Amiga Plus 1995 #2

home *** CD-ROM | disk | FTP | other *** search

/ Amiga Plus 1995 #2 / Amiga Plus CD - 1995 - No. 2.iso / internet / faq / englisch / comp.speech < prev next >

Wrap

Text File | 1995-04-11 | 141.2 KB | 3,188 lines

Archive-name: comp-speech-faq/part1 Last-modified: 1995/01/11 COMP.SPEECH FAQ POSTING - PART 1/3 [Note: this document has been automatically extracted from a WWW site: http://www.speech.su.oz.au/comp.speech This may introduce some formatting errors.] Comp.Speech Frequently Asked Questions The Frequently Asked Questions (FAQ) is a regular posting to comp.speech which attempts to answer some of the regular questions in the comp.speech newsgroup. The FAQ is not meant to discuss any topic exhaustively. It will hopefully provide readers with pointers on where to find useful information, especially material available on the Internet. If you have not already read the Usenet introductory material posted to "news.announce.newusers", please do. For help with FTP (file transfer protocol) look for a regular posting of "Anonymous FTP List - FAQ" in comp.misc, comp.archives.admin or news.answers. This FAQ is posted every 4 weeks to comp.speech, comp.answers & news.answers. It is also available for anonymous ftp from the comp.speech archive site : * ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/FAQ-complete Or from the news.answers ftp site (and its mirrors) * ftp://rtfm.mit.edu/pub/usenet/news.answers/comp-speech-faq/* Or on the World Wide Web * http://www.speech.su.oz.au/comp.speech Or by sending email to mail-server@rtfm.mit.edu with the following line in the body of the message: * send usenet/news.answers/comp-speech-faq/* Admin Not much to report this month. Hopefully, February should see some major catch-up work. FAQ Sections The FAQ is divided into the following sections: * FAQ Contents * List of Speech Technology Products and Software * FAQ Section 1: General Information on Speech Technology * FAQ Section 2: Signal Processing * FAQ Section 3: Speech Coding and Compression * FAQ Section 4: Natural Language Processing * FAQ Section 5: Speech Synthesis * FAQ Section 6: Speech Recognition Comp.Speech FTP Site The comp.speech ftp site (which is described in Q1.2) contains the following: * Newsgroup Archives * Data Resources * General Information * Software Acknowledgements Hundreds of people have made contributions to the comp.speech FAQ over the last two years; there are too many to name individually. Special thanks go to Tony Robinson and Joe Campbell who have been particularly helpful. Maintainence The FAQ posting and the Comp.Speech WWW Site are maintained by Andrew Hunt --- Speech Technology Research Group Dept. of Electrical Engineering University of Sydney, NSW, 2006, Australia Ph: 61-2-351 4509 Fax: 61-2-351 3847 email: andrewh@speech.su.oz.au =========================================================================== COMP.SPEECH FAQ CONTENTS Introduction * Overview * List of Packages Section 1 : General Information on Speech Technology * Q1.1 What is comp.speech? * Q1.2 Where are the comp.speech archives? * Q1.3 Common abbreviations and jargon. * Q1.4 What are related newsgroups and mailing lists? * Q1.5 What are related journals and conferences? * Q1.6 What resources are available as handicap aids? * Q1.7 What speech data is available? * Q1.8 Speech File Formats, Conversion and Playing. * Q1.9 What "Speech Laboratory Environments" are available? * Q1.10 Miscelaneous Software and Other Resources. Section 2 : Signal Processing for Speech * Q2.1 What sampling do I need for speech? * Q2.2 How do I find the pitch of a speech signal? * Q2.3 How do I find the start and end points of a speech signal? * Q2.4 Where can I find FFT software? * Q2.5 What signal processing techniques are used in speech technology? * Q2.6 What speech sampling and signal processing hardware can I use? * Q2.7 How do I convert to/from mu-law format? Section 3 : Speech Coding and Compression * Q3.1 Speech compression techniques. * Q3.2 What are some good references/books on coding/compression? * Q3.3 What software is available? (Includes CELP & G.7xx) Section 4 : Natural Language Processing * Q4.1 What are some good references/books on NLP? * Q4.2 What NLP software is available? Section 5 : Speech Synthesis * Q5.1 What is speech synthesis? * Q5.2 How can speech synthesis be performed? * Q5.3 What are some good references/books on synthesis? * Q5.4 What software/hardware is available? Section 6 : Speech Recognition * Q6.1 What is speech recognition? * Q6.2 How can I build a very simple speech recogniser? * Q6.3 What does speaker dependent/adaptive/independent mean? * Q6.4 What does small/medium/large/very-large vocabulary mean? * Q6.5 What does continuous speech or isolated-word mean? * Q6.6 How is speech recognition done? * Q6.7 What are some good references/books on recognition? * Q6.8 What speech recognition packages are available? =========================================================================== FAQ: List of Packages The comp.speech FAQ provides information on a range of software, hardware and resources. Speech Data * Phonemic Samples * Linguistic Data Consortium (LDC) * Center for Spoken Language Understanding (CSLU) * PhonDat - A Large Database of Spoken German * Oxford Acoustic Phonetic Database Speech Processing Environments * Entropic Signal Processing System (ESPS) and Waves * CSRE: Canadian Speech Research Environment * OGI Speech Tools * Matlab plus Signal Processing Toolbox * Signalyze 3.0 from InfoSignal * Kay Elemetrics CSL (Computer Speech Lab) 4300 * MacSpeech Lab II (MSL II) * N!Power * Ptolemy * Khoros * SpeechViewer II Other Resources * CMU Dictionary * Another Dictionary * BEEP dictionary * CUVOLAD dictionary * MRC database * Network Audio System * NEVOT (1.4v) from AT&T; BL * Human Audio Perception Document * Homophone List * Auditory Toolbox for Matlab * Auditory Modeller 1 * Auditory Modeller 2 Audio I/O Hardware * Sun standard audio port (SPARC I & II) * Sun standard audio port (SPARC 10 & 20) * Ariel Signal Processors * IBM RS/6000 ACPA (Audio Capture and Playback Adapter) * Sound Galaxy NX , Aztech Systems * Sound Galaxy NX PRO, Aztech Systems * ATI Stereo F/X Sound Board * Various PC Sound Cards Compression Software and Hardware * File format conversion * shorten - a lossless compressor for speech signals * 32 kbps ADPCM * GSM 06.10 Compression * G.721/722/723 Compression * G.728 Compression * G.728 LD-CELP vocoder * U.S.F.S. 1016 CELP vocoder for DSP56001 * 8 Kbit/s CELP on the TMS320C5x family of DSP chips * CELP 3.2a & LPC Natural Language Processing * Natural Language Software Registry (NLSR) - NLP Tools * Part of Speech Tagger Speech Synthesis * Orator Text-to-Speech Synthesizer * Text to phoneme program (1) * Text to phoneme program (2) * Text to phoneme program (3) * Text to speech program * "Speak" - a Text to Speech Program * TheBigMouth - a Text to Speech Program * TextToSpeech Kit * SGI Developers Toolbox Synthesiser * rsynth * SENSYN speech synthesizer * spchsyn.exe * CSRE: Canadian Speech Research Environment * Eloquence (currently an alpha release) * JSRU * Klatt-style synthesiser * DECTalk * Speech Manager and PlainTalk * Various Mac Speech Output Applications * MacinTalk * Monologue by Creative Labs * Lernout & Hauspie Text-To-Speech SDK * Tinytalk * Narrator - narrator.device * Infovox Product Range * SIMTEL-20 Speech Recognition * HM2007 - Speech Recognition Chip * Voice Blaster Ver. 4.0 * Votan * Entropic's HTK (HMM Toolkit) * DragonDictate version 3.0 * DragonDictate for Windows * DragonVoiceTools * IBM Personal Dictation System * Osborne Personal Dictation System (in Australia) * VoiceServer for Windows * IN3 Voice Command for Windows * IN3 Voice Command * Phonetic Engine 400 (PE400) - Speech Systems, Inc. * SayIt * Kurzweil Voice for Windows 1.0 * D6006 Voice Control Processor * Speech Commander - Listen for Windows * Voice-Trek 2.0 * Visus SpeechKit * recnet * Lotec Speech Recognition Package * Myers' Hidden Markov Model software * Voice Command Line Interface * DATAVOX - French * PowerSecretary * ICSS system from IBM * Creative VoiceAssist =========================================================================== FAQ SECTION 1 - General Q1.1: WHAT IS COMP.SPEECH? Comp.speech is a newsgroup for discussion of speech technology and speech science. It covers a wide range of issues from application of speech technology, to research, to products and lots more. By nature speech technology is an inter-disciplinary field and the newsgroup reflects this. However, computer application is the basic theme of the group. The following is a list of topics but does not cover all matters related to the field (no order of importance is implied). * Speech Recognition - discussion of methodologies, training, techniques, results and applications. This should cover the application of techniques including HMMs, neural-nets and so on to the field. * Speech Synthesis - discussion concerning theoretical and practical issues associated with the design of speech synthesis systems. * Speech Coding and Compression - both research and application matters. * Phonetic/Linguistic Issues - coverage of linguistic and phonetic issues which are relevant to speech technology applications. Could cover parsing, natural language processing, phonology and prosodic work. * Speech System Design - issues relating to the application of speech technology to real-world problems. Includes the design of user interfaces, the building of real-time systems and so on. * Other matters - relevant conferences, jobs, books, software, hardware, and products. _________________________________________________________________ Q1.2: WHERE ARE THE COMP.SPEECH ARCHIVES? comp.speech is being archived for anonymous ftp. * ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/archive/ comp.speech/archive contains the articles as they arrive. Batches of 100 articles are grouped into a shar file, along with an associated file of Subject lines. Other useful information is also available in comp.speech/info. _________________________________________________________________ Q1.3: COMMON ABBREVIATIONS AND JARGON. * ANN - Artificial Neural Network. * ASR - Automatic Speech Recognition. * ASSP - Acoustics Speech and Signal Processing * AVIOS - American Voice I/O Society * CELP - Code-book Excited Linear Prediction. * COLING - Computational Linguistics * DTW - Dynamic Time Warping. * FAQ - Frequently Asked Questions. * HMM - Hidden Markov Model. * IEEE - Institute of Electrical and Electronics Engineers * JASA - Journal of the Acoustic Society of America * LPC - Linear Predictive Coding. * LVQ - Learned Vector Quantisation. * NLP - Natural Language Processing. * NN - Neural Network. * TI - Texas Instruments. * TIMIT - A large speech corpus from TI and MIT - see Q1.7 * TTS - Text-To-Speech (i.e. synthesis). * VQ - Vector Quantisation. _________________________________________________________________ Q1.4: WHAT ARE RELATED NEWSGROUPS AND MAILING LISTS? Newsgroups comp.ai - Artificial Intelligence newsgroup. Postings on general AI issues, language processing and AI techniques. Has a good FAQ including NLP, NN and other AI information. comp.ai.nat-lang - Natural Language Processing Group Postings regarding Natural Language Processing. Set up to cover a broard range of related issues and different viewpoints. comp.ai.nlang-know-rep - Natural Language Knowledge Representation Moderated group covering Natural Language. comp.ai.neural-nets - discussion of Neural Networks and related issues. There are often posting on speech related matters - phonetic recognition, connectionist grammars and so on. comp.compression - occasional articles on compression of speech. FAQ for comp.compression has some info on audio compression standards. comp.dcom.telecom - Telecommunications newsgroup. Has occasional articles on voice products. comp.dsp - discussion of signal processing - hardware and algorithms and more. Has a good FAQ posting. Has a regular posting of a comprehensive list of Audio File Formats. comp.multimedia - Multi-Media discussion group. Has occasional articles on voice I/O. sci.lang - Language. Discussion about phonetics, phonology, grammar, etymology and lots more. alt.sci.physics.acoustics Some discussion of speech production & perception. alt.binaries.sounds.misc - posting of various sound samples alt.binaries.sounds.d - discussion about sound samples, recording and playback. Mailing Lists ECTL - Electronic Communal Temporal Lobe Founder & Moderator: David Leip. Moderated mailing list for researchers with interests in computer speech interfaces. This list serves a broad community including persons from signal processing, AI, linguistics and human factors. To subscribe, send your name, institute, department, daytime phone and email address to: + ectl-request@snowhite.cis.uoguelph.ca The ECTL archive site is + ftp://snowhite.cis.uoguelph.ca/pub/ectl Prosody Mailing List Unmoderated mailing list for discussion of prosody. The aim is to facilitate the spread of information relating to the research of prosody by creating a network of researchers in the field. If you want to participate, send the following one-line message to + listserv@msu.edu + subscribe prosody Your Name foNETiks A moderated monthly newsletter distributed by e-mail. It carries job advertisements, notices of conferences, and other news of general interest to phoneticians, speech scientists and others The editors are Linda Shockey and Gerry Docherty. To subscribe send the following 1 line message to + mailbase@mailbase.ac.uk + join fonetiks your_first_name your_second_name Digital Mobile Radio Covers lots of areas include some speech topics including speech coding and speech compression. Mail Peter Decker dec@dfv.rwth-aachen.de to subscribe. _________________________________________________________________ Q1.5: WHAT ARE RELATED JOURNALS AND CONFERENCES? Try the following commercially oriented magazine: * Voice News - monthly industry newsletter Stoneridge Technical Services PO Box 1891, Rockville, MD, 20850, USA Phone: (301) 424-0114 * Voice Technology News * Voice Processing Magazine (1-800-854-3112) * Speech Technology (no longer published) Try the following technical journals (some contact addresses below):- * IEEE Transactions on Speech and Audio Processing (from Jan 93) * IEEE Signal Processing Magazine (from Jan 93) * IEEE Transactions on Acoustics, Speech, and Signal Processing (ASSP) (now obsolete) * Computational Linguistics (COLING) * Computer Speech and Language * Journal of the Acoustical Society of America (JASA) * AVIOS Journal * ASR News Try the following conferences:- * ICASSP Intl. Conference on Acoustics Speech and Signal Processing (IEEE) * ICSLP Intl. Conference on Spoken Language Processing * EUROSPEECH European Conference on Speech Communication and Technology * AVIOS American Voice I/O Society Conference * SST Australian Speech Science and Technology Conference Here are a few contact addresses:- Publications: IEEE Transactions on Speech and Audio Processing (from Jan 93) IEEE Transactions on Acoustics, Speech, and Signal Processing (ASSP) - now obsolete. Organization: Institute of Electrical and Electronics Engineers (IEEE) Contact: IEEE Service Center 445 Hoes Lane, PO Box 1331, Piscataway, NJ 08855, USA Phone: 1-800-678-IEEE or (201)981-0060 Publications: Computer Speech and Language Contact: Academic Press, Ltd. 24-28 Oval Rd, London NW1, England Price: $136 (Institutions), $58 (Individuals) Publications: Association for Computational Linguistics Organization: Association for Computational Linguistics MIT Press Journals 55 Hayward St, Cambridge, MA 02142, USA Phone: (617)253-2889 _________________________________________________________________ Q1.6: WHAT RESOURCES ARE AVAILABLE AS HANDICAP AIDS? Can anyone provide information on speech technology aids for the deaf, blind, speech impaired, physically impaired and other groups who may benefit from speech technology? SpeechViewer II * Platform: IBM Machines from Mod 25 on. * Description: SpeechViewer II is a speech therapy tool. It provided graphical feedback of various speech features so that speech impaired individuals can improve their speech. It works with an audio bandwidth of 7.3 Khz and thus allows the therapist to work with sustained vowels and fricatives. A wide range of graphics are used to provide adequate variability to hold client interest. An extensive set of statistics are gathered which allows a therapist to do research or keep therapy records. The speech therapy modules are: + Awareness - Sound, Loudness, Pitch, Voicing Onset, Voicing + Skill Building - Pitch, Voicing, Phonology + Patterning - Pitch & Loudness - Waveform & Spectrogram, Spectra + Clinical Management - Profiles, Models, Client Data * Hardware: Requires an IBM M-ACPA (Multimedia-Audio Capture Playback Adapter). It has a TI TMS320C25 DSP chip. The input sampling rate is 44.1 Khz stereo, 88.2 Khz mono. This is a 16 bit card. It has the following jacks: mic in, stereo line in, stereo line out, speaker out. Note: This card is being replaced by Mwave technology. For more info on Mwave contact Texas Instruments. * Price: + The software is $2130 list, $1491 educational, part number 92F2066. + The M-ACPA is $370 list, $222 educational, part number 92F3378. + The MicroChannel adapter part number is 92F3379 (same price). * Contact: The Psychological Corporation (TPC) [IBM Authorized Remarketer] Phone: 1-800-228-0752 or contact IBM on 1-800-426-4832. _________________________________________________________________ Q1.7: WHAT SPEECH DATA IS AVAILABLE? A wide range of speech databases have been collected. These databases are primarily for the development of speech synthesis/recognition and for linguistic research. Some databases are free but most appear to be available for a small cost. The databases normally require lots of storage space - do not expect to be able to ftp all the data you want. Phonemic Samples * First, some basic data. The following ftp sites have samples of English phonemes (American accent I believe) in Sun audio format files. See Question 1.8 for information on audio file formats. + ftp://sounds.sdsu.edu/.1/phonemes: This ftp site appears to be obsolete. Does anyone know a new address? + ftp://phloem.uoregon.edu/pub/Sun4/lib/phonemes : There appears to be some config problem with this ftp server. + ftp://sunsite.unc.edu/pub/multimedia/sun-sounds/phonemes Linguistic Data Consortium (LDC) * Briefly stated, the LDC has been established to broaden the collection and distribution of speech and natural language data bases for the purposes of research and technology development in automatic speech recognition, natural language processing and other areas where large amounts of linguistic data are needed. Here is list of some of the corpora: + The TIMIT and NTIMIT speech corpora + The Resource Management speech corpus (RM1, RM2) + The Air Travel Information System (ATIS0) speech corpus + The Association for Computational Linguistics - Data Collection Initiative text corpus (ACL-DCI) + The TI Connected Digits speech corpus (TIDIGITS) + The TI 46-word Isolated Word speech corpus (TI-46) + The Road Rally conversational speech corpora (including "Stonehenge" and "Waterloo" corpora) + The Tipster Information Retrieval Test Collection + The Switchboard speech corpus ("Credit Card" excerpts and portions of the complete Switchboard collection) * Further resources made available in the first year (or two): + The Machine-Readable Spoken English speech corpus (MARSEC) + The Edinburgh Map Task speech corpus + The Message Understanding Conference (MUC) text corpus of FBI terrorist reports + The Continuous Speech Recognition - Wall Street Journal speech corpus (WSJ-CSR) + The Penn Treebank parsed/tagged text corpus + The Multi-site ATIS speech corpus (ATIS2) + The Air Traffic Control (ATC) speech corpus + The Hansard English/French parallel text corpus + The European Corpus Initiative multi-language text corpus (ECI) + The Int'l Labor Organization/Int'l Trade Union multi-language text corpus (ILO/ITU) + Machine-readable dictionaries/lexical data bases (COMLEX, CELEX) * Detailed information about the Linguistic Data Consortium is available by anonymous from the address below. The files in the directory include more detailed information on the individual databases. + ftp://ftp.cis.upenn.edu/pub/ldc * For further information contact Linguistic Data Consortium 441 Williams Hall, University of Pennsylvania Philadelphia, PA 19104-6305 Phone: +1 (215) 898-0464 Fax: +1 (215) 573-2175 e-mail: ldc@unagi.cis.upenn.edu Center for Spoken Language Understanding (CSLU) * The ISOLET speech database of spoken letters of the English alphabet. The speech is high quality (16 kHz with a noise cancelling microphone). 150 speakers x 26 letters of the English alphabet twice in random order. The ISOLET data base can be purchased for $100 by sending an email request to vincew@cse.ogi.edu. (This covers handling, shipping and medium costs). The data base comes with a technical report describing the data. * CSLU has a telephone speech corpus of 1000 English alphabets. Callers recite the alphabet with brief pauses between letters. This database is available to not-for-profit institutions for $100. The data base is described in the proceedings of the International Conference on Spoken Language Processing. + Contact vincew@cse.ogi.edu if interested. * CSLU has released for universities its Continuous English Speech Corpus. The corpus contains recorded speech from 690 different speakers, with label files at various levels - including word level and phonetic labels. The data were collected as part of the OGI Multi-language telephone corpus. CSLU provides speech corpora to all universities without charge. To order a corpus, print the license agreement/order form, complete it, and fax it to the CSLU. A description of the corpora and an order form are available by anonymous ftp: + ftp://speech.cse.ogi.edu/pub/releases * Contact: Mike Noel - email: noel@cse.ogi.edu Phone: (503) 690-1309 PhonDat - A Large Database of Spoken German * The PhonDat continuous speech corpora are now available on CD-ROM media (ISO 9660 format). + PhonDat I (Diphone Corpus) : 6 CDs (1140.- DM) + PhonDat II (Train Enquiries Corpus): 1 CD ( 190.- DM) * PhonDat I comprises approx. 20.000, PhonDat II approx. 1500 signal files in high quality 16-bit 16 KHz recording. The corpora come with documentation containing the orthographic transcription and a citation form of the utterances, as well as a detailed file format description. A narrow phonetic transcription is available for selected files from corpus I and II. * For information and orders contact Barbara Eisen Institut fuer Phonetik Schellingstr. 3 / II D 80799 Munich 40 Tel: +49 / 89 / 2180 -2454 or -2758 Fax: +49 / 89 / 280 03 62 Oxford Acoustic Phonetic Database * Available on compact disc, from J. Pickering and B. Rosner. It contains data on vowel-consonant and consonant-vowel combinations in both stressed and unstressed locations. The language covered include French, German, Hungarian, Italian, Japanese, British English, Spanish and English. For further information write to Electronic Publishing, Oxford University Press, Walton Street, Oxford OX2 6DP, UK. The ISBN is 0-19-268086-2 * Contact: Prof. B. Rosner Dept. of Experimental Psychology South Parks Rd, Oxford, OX1 3UD, UK email: burton.rosner@wolfson.ox.ac.uk _________________________________________________________________ Q1.8: SPEECH FILE FORMATS, CONVERSION AND PLAYING. Section 2 of this FAQ has information on mu-law coding. A very good and very comprehensive list of audio file formats is prepared by Guido van Rossum. The list is posted regularly to comp.dsp and alt.binaries.sounds.misc, amongst others. It includes information on sampling rates, hardware, compression techniques, file format definitions, format conversion, standards, programming hints and lots more. It is also available by ftp from * ftp://ftp.cwi.nl/pub/audio/AudioFormats.part1,2 _________________________________________________________________ Q1.9: WHAT "SPEECH LABORATORY ENVIRONMENTS" ARE AVAILABLE? First, what is a Speech Laboratory Environment? A speech lab is a software package which provides the capability of recording, playing, analysing, processing, displaying and storing speech. Your computer will require audio input/output capability. The different packages vary greatly in features and capability - best to know what you want before you start looking around. Most general purpose audio processing packages will be able to process speech but do not necessarily have some specialised capabilities for speech (e.g. formant analysis). The following article provides a good survey. * Read, C., Buder, E., & Kent, R. "Speech Analysis Systems: An Evaluation" Journal of Speech and Hearing Research, pp 314-332, April 1992. Entropic Signal Processing System (ESPS) and Waves * Platform: Range of Unix platforms. * Description: ESPS is a comprehensive set of speech analysis/processing tools for the UNIX environment. The package includes UNIX commands, and a comprehensive C library (which can be accessed from other languages). Waves is a graphical front-end for speech processing. Speech waveforms, spectrograms, pitch traces etc can be displayed, edited and processed in X windows and Openwindows (versions 2 & 3). Waves also includes a signal labelling utility which provides multiple feature labelling and useful features for fast labelling of large speech databases. Entropic also distributes HTK (the Hidden Markov Model Toolkit). HTK is described in Section 6 of this FAQ. * Cost: On request. * Contact: Entropic Research Laboratory, Washington Research Laboratory 600 Pennsylvania Ave, S.E. Suite 202, Washington, D.C. 20003 (202) 547-1420 email - info@entropic.com CSRE: Canadian Speech Research Environment * Platform: IBM/AT-compatibles * Description: CSRE is a microcomputer-based system designed to support speech research. CSRE provides a low-cost facility in support of speech research, using mass-produced and widely-available hardware. The project is non-profit, and relies on the cooperation of researchers at a number of institutions and fees generated when the software is distributed. Functions include speech capture, editing, and replay; several alternative spectral analysis procedures, with color and surface/3D displays; parameter extraction/ tracking and tools to automate measurement and support data logging; alternative pitch-extraction systems; parametric speech (KLATT80) and non-speech acoustic synthesis, with a variety of supporting productivity tools; and an experiment generator, to support behavioral testing using a variety of common testing protocols. A paper about the whole package can be found in: + Jamieson D.G. et al, "CSRE: A Speech Research Environment", Proc. of the Second Intl. Conf. on Spoken Language Processing, Edmonton: University of Alberta, pp. 1127-1130. * Hardware: Can use a range of data aqcuisition/DSP hardware * Cost: Distributed on a cost recovery basis. * Availability: For more information on availability contact Krystyna Marciniak email march@uwovax.uwo.ca Tel (519) 661-3901 Fax (519) 661-3805. For technical information email ramji@uwovax.uwo.ca * Note: Also included in Q5.4 on speech synthesis packages. OGI Speech Tools * Developers from the Center for Spoken Language Understanding (CSLU) at the Oregon Graduate Institute of Science and Technology (Portland Oregon) * Platform: Unix * Description: The OGI Speech tools include : + An X windows display tool (LYRE) for displaying data in a time synchronous fashion for a. the speech signal b. spectrograms c. phoneme labels, and other information. + A Neural Network (NOPT) training package. + An set of C library routines (LIBNSPEECH) for the manipulation of speech data, including: a. PLP Analysis, b. Rasta PLP Analysis, c. Linear Predictive Coding, d. Mel Cepstrum Coding, e. Fast Fourier Transform + A set of utilities for converting file formats such as ADC, NIST, mu-law, binary files, and ascii. Includes filtering. + A database utility (find_phone) to automate speech database related enquiries. It allows the user to specify a particular label or set of labels in a given context, display all occurrences of the label, and relabel the occurrences if desired. + A Vector-Quantizer based on the Linde Buzo and Gray (LBG) algorithm. + A set of PERL Scripts which have been used mainly to automate the use of the OGI Speech Tools. + MAN Pages for all routines and programs developed, as well as a User manual in both in postscript and tex format. * Misc: Software is written in ANSI C. * Availability: By anonymous ftp from + ftp://speech.cse.ogi.edu/pub/tools/ * Contact: Try tools@cse.ogi.edu Matlab plus Signal Processing Toolbox * Platform: Wide range * Description: Matlab (MATrix LABoratory) is a technical computing environment for numerical computation and visualization based on a matrix oriented, interpreted programming language. The programming environment provides support for the development of customized operations, along with debugging facilities and a graphical user interface toolkit. Audio output is provided. A specialised Signal Processing Toolbox is available which provides many functions which are useful for speech analysis. It includes filter design, spectral estimation, statistical signal processing, waveform generation, and signal and spectrogram display. A specialised Auditory Toolbox is available which contains functions useful to people interested in auditory/cochlear models. A more detailed description is given in Q1.10. * Price: On request. * Contact: The Math Works Inc. 24 Prime Park Way, Natick, MA 01760-1500 USA Ph: 1-508-653 1415 Fax: 1-508-653 6284 Email: info@mathworks.com * FTP: ftp://ftp.mathworks.com * WWW: http://www.mathworks.com/ Signalyze 3.0 from InfoSignal * Platform: Macintosh * Description: Signalyze's basic conception revolves around up to 100 signals, displayed synchronously in HyperCard fashion on "cards". The program offers a complement of signal editing features, quite a few spectral analysis tools, manual scoring tools, pitch extraction routines, a good set of signal manipulation tools, and extensive input-output capacity. Handles multiple file formats: Signalyze, MacSpeech Lab, AudioMedia, SoundDesigner II, SoundEdit/MacRecorder, SoundWave, three sound resource formats, and ASCII-text. Sound I/O: Direct sound input from MacRecorder and similar devices, AudioMedia, AudioMedia II and AD IN, some MacADIOS boards and devices, Apple sound input (built-in microphone). Sound output via Macintosh internal sound, via SoundManager 3.0, some MacADIOS boards and devices as well as via the Digidesign 16-bit boards. It has a range of capabilities for creating, editing and manipulating label files with flexibility in labelling format. * Compatibility: MacPlus and higher (including II, IIx, IIcx, IIci, IIfx, IIvx, IIvi, Portable, all PowerBooks, Centris and Quadras). Takes advantage of large and multiple screens and 16/256 color/grayscales. System 7.0 compatible. Runs in background with adjustable priority. * Misc: A demo available upon request. Manuals and tutorial included. It is available in English, French, and German. An UPDATER to version 2.48 is now available in: + - The UNIL Gopher server (see last page of InfoSignal News 8) + - The LAIP FTP server. Address: MACFL4082.unil.ch, machine no. 130.223.104.31 Also available are a demo program, and current questions and answers. * Cost: Individual licence US$350, site license US$500, plus shipping. Upgrades from version 2.0 are available. * Contact: North America - Network Technology Corporation 91 Baldwin St., Charlestown MA 02129 Fax: 617-241-5064 Phone: 617-241-9205 Elsewhere contact InfoSignal Inc. C.P. 73, 1015 LAUSANNE, Switzerland, FAX: +41 21 691-1372, Email: 76357.1213@COMPUSERVE.COM. Kay Elemetrics CSL (Computer Speech Lab) 4300 * Platform: Minimum IBM PC-AT compatible with extended memory (min 2MB) with at least VGA graphics. Optimal would be 386 or 486 machine with more RAM for handling larger amounts of data. * Description: Speech analysis package, with optional separate LPC program for analysis/synthesis. Uses its own file format for data, but has some ability to export data as ascii. The main editing/analysis prog (but not the LPC part) has its own macro language, making it easy to perform repetitive tasks. Probably not much use without the extra LPC program, which also allows manipulation of pitch, formant and bandwidth parameters. Hardware includes an internal DSP board for the PC (requires ISA slot), and an external module containing signal processing chips which does A/D and D/A conversion. * Misc: A programmers kit is available for programming signal processing chips (experts only). A speaker and microphone are supplied. Manuals are included. * Cost: Recently approx 6000 pounds sterling. * Contact: UK distributors are Wessex Electronics, 114-116 North Street, Downend, Bristol, B16 5SE Tel: 0272 571404. In the USA contact: Kay Elemetrics Corp, 12 Maple Avenue, PO Box 2025, Pine Brook, NJ 07058-9798 Tel:(201) 227-7760 MacSpeech Lab II (MSL II) * Platform: Macintosh * Description: A sound analysis and acquisition for Macs. MSL II delivers the most common functions for speech analysis (FFTs, LPCs, f0 extraction, etc.) & produces grayscale spectrographic displays. Can be used for various speech technology and phonetic training tasks. The software an trade off accuracy and speech. * Hardware: Requires MacADIOS ("Macintosh Analog/Digital Input/Output System") hardware for speech I/O at 12/16 bits. * Misc: Software no longer updated by GW Instruments; MSL soft/hardware will not perform input/output on Quadras, for example, though analysis seems fine. Known to operate properly on systems as high as IIcx & II fx. * Cost: $4990 (in May '92 price list; no MSL soft/hardware package listed in January '93). * Contact: GW Instruments 35 Medford Street, Somerville, MA 02143 Phone: (617) 625-4096 Fax: (617) 625-1322 N!Power * Platform: SUN, DEC and HP workstations. * Description: An object-oriented software package with a MOTIF GUI interface and a range of functionality for data analysis/editing, signal analysis, speech processing, real-time A/D and D/A, and 2D/3D interactive graphics. N!Power replaces ILS. N!Power can provide a Block Diagram user interface, menus, pop-ups, and a high-level IEEE standard symbolic scripting language. You can customize the blocks, menus and pop-ups with mouse point-and-click operations. * Contact: Signal Technology, Inc. 104 W. Anapamu, Suite J, Santa Barbara, CA 93101-3126 Phone: 805-899-8300 FAX: 805-899-4344 email: larry@signal.com Ptolemy * Platform: Sun SPARC, DecStation (MIPS), HP (hppa). * Description: Ptolemy provides a highly flexible foundation for the specification, simulation, and rapid prototyping of systems. It is an object oriented framework within which diverse models of computation can co-exist and interact. Ptolemy can be used to model entire systems. Ptolemy has been used for a broad range of applications including signal processing, telecomunications, parallel processing, wireless communications, network design, radio astronomy, real time systems, and hardware/software co-design. Ptolemy has also been used as a lab for signal processing and communications courses. Ptolemy has been developed at UC Berkeley over the past 3 years. Further information, including papers and the complete release notes, is available from the FTP site. * Cost: Free * Availability: The source code, binaries, and documentation are available by anonymous ftp from + ftp://ptolemy.berkeley.edu/pub/README Khoros * Description: Public domain image processing package with a basic DSP library. Not particularly applicable to speech, but not bad for the price. * Cost: Free * Availability: By anonymous ftp from ftp://pprg.eece.unm.edu SpeechViewer II * Description: Speech Therapy Tool. See the detailed description in the handicap section - Q1.6. _________________________________________________________________ Q1.10: MISCELANEOUS SOFTWARE AND OTHER RESOURCES. CMU dictionary * Description: Phonemic transcriptions of 100,000 words with American English pronunciation. * Availability: By anonymous ftp from the directory + ftp://ftp.cs.cmu.edu/project/fgdata/dict with the files README, cmudict.0.2.Z, cmulex.0.1.Z, phoneset.0.1 Dictionary * Description: A comprehensive word list which should contain most common American words, abbreviations, hyphenations, and even incorrect spellings. The word lists were compiled from a number of sources: commercial news services, UseNet news postings, existing dictionaries, name lists, company lists, UNIX man pages, project Gutenberg's E-texts, project Wordnet, received mailings, etc. The current size is 460,000 words. * Availability: By anonymous ftp from + ftp://wocket.vantage.gte.com:/pub/standard_dictionary Note 1: There seems to be some sort of network problem reaching the server. Note 2: There is a README file which explains the file formats. BEEP dictionary * Description: Phonemic transcriptions of 100,000 English words. (British English pronunciations) * Availability: By anonymous ftp from the file + svr-ftp.eng.cam.ac.uk/comp.speech/data/beep-0.3.tar.Z CUVOLAD dictionary * Description: Computer Usable Version of the Oxford Advanced Learner's Dictionary Has British English pronunciations and parts of speech * Availability: By anonymous ftp from the directory + ftp://black.ox.ac.uk/ota/dicts/710 MRC database * Description: The Medical Research Council Psycholinguistic Database Has British English pronunciations, parts of speech, word frequency and lots of other information. * Availability: By anonymous ftp from the directory + ftp://black.ox.ac.uk/ota/dicts/1054 Network Audio System Release 1.1 * Platforms: Various (includes SunOS, Solaris, SGI) * Description: A device-independent mechanism for transferring, playing and recording audio signals over a network. Has a range of features suited to networks. * Cost: Free * Availability: By anonymous ftp from + ftp://ftp.x.org:/contrib/audio/nas/netaudio-1.2.tar.gz Also available in the same directory are document files and some sample sounds. AF version AF3R1 * Platforms: DEC workstations (Alpha and MIPS), SparcStation, SGI * Description: The AF System is a device-independent network-transparent system including client applications and audio servers. With AF, multiple audio applications can run simultaneously, sharing access to the actual audio hardware. The AF3R1 distribution of AF includes server support for Digital RISC systems running Ultrix, Digital Alpha AXP systems running OSF/1, SGI Indigo running IRIX 4.0.5, Sun Microsystems SPARCstations running SunOS 4.1.3, and Sun Microsystems SPARCstations running Solaris 2.3. The servers support audio hardware ranging from the built-in CODEC audio on SPARCstations and Personal DECstations to 48 KHz stereo audio using the DECaudio TURBOchannel module or the SPARCstation DBRI interface * Availability: The source kit is distributed by anonymous ftp from + ftp://crl.dec.com/pub/DEC/AF * Contact: af-request@crl.dec.com + http://www.research.digital.com/CRL/projects/AF/home.html NEVOT (1.4v) from AT&T; BL * Platforms: Sun Sparc Station (SunOS 4.1.x) and Silicon Graphics * Description: Audio-conferencing tool which supports both point-to-point and broadcasting of audio using multicast IP. Audio encoding: + PCM 64kb/s 8-bits u-law encoded 8KHz PCM (G.711) + ADPCM 32 kb/s [Sun only] (G.721) + DVI ADPCM 32 kb/s + ADPCM 24 kb/s [Sun only] (G.723) + CELP 4.8 kb/s + LPC 2.4 kb/s Source is available. * Availability: by anonymous ftp from + ftp://gaia.cs.umass.edu/pub/hgschulz/nevot * Contact: Henning Schulzrinne (hgs@researh.att.com) Human Audio Perception Document * Description: Document prepared by Argiris Kranidiotis on the human audio perception system. It lists a number of references, gives plenty of numbers and some equations. * Availability: by anonymous ftp from the comp.speech archive site + ftp://svr-ftp.eng.cam.ac.uk/comp.speech/info/HumanAudioPercept ion * Contact: Argiris A. Kranidiotis University Of Athens, Informatics Department email: akra@zeus.di.uoa.ariadne-t.gr Homophone List * A list of homophones in General American English is available by anonymous FTP from the comp.speech archive site: + ftp://svr-ftp.eng.cam.ac.uk/comp.speech/data/homophones-1.01.t xt Auditory Toolbox for Matlab * Description: This toolbox provides extensions to Matlab which are useful to people interested in auditory/cochlear modeling. [Matlab is described is the previous section.] This toolbox has been tested on both Macintosh and Unix computers. It includes the following major models: + Lyon's Passive Long Wave Cochlear Model (our conventional model) + Patterson-Holdsworth ERB Filter bank with Meddis Hair cell + Seneff's Auditory Model (Stages I and II) + MFCC (Mel-scale frequency cepstral coefficients from the ASR world) + Spectrogram + Correlogram generation and pitch modeling + Simple vowel synthesis * Availability: By anonymous FTP from the following site: + ftp://ftp.apple.com/pub/malcolm The following files are available: + 419487 AuditoryToolbox.mif.Z + 1372976 AuditoryToolbox.psc.Z + 573215 AuditoryToolbox.sea.hqx + 92160 AuditoryToolbox.tar + 36405 AuditoryToolbox.tar.Z The ".mif.Z" file is a Unix compressed version of the FrameMaker documentation. The ".psc.Z" file is a Unix compressed version of the Postscript documentation. The ".tar" and ".tar.Z" files are Unix TAR archives containing all of the m-functions and C-MEX source code. Finally, the ".sea.hqx" file is a Macintosh self-extracting archive that has been encoded using BinHex. We do provide precompiled version of the three MEX function for the Macintosh. * Misc: Our lawyers ask you to remind you that there is no warranty. We've done some testing but we undoubtably missed things. * Contact: Malcolm Slaney: Interval Resarch. Email: malcolm@interval.com Auditory Modeller 1 * Description: John Holdsworth's implementation of a gammatone filter bank and Roy Patterson's spiral model, in C (with X-window display). * Availability: By anonymous ftp from + ftp://ftp.mrc-apu.cam.ac.uk/pub/aim Auditory Modeller 2 * Description: Lowel O'Mard's implementation of peripheral filtering, Ray Meddis's hair cell model and other stuff in C (as a library of routines). * Availability: By anonymous ftp from + ftp://suna.lut.ac.uk/public/hulpo/lutear _________________________________________________________________ Andrew Hunt --- Speech Technology Research Group Ph: 61-2-351 4509 Dept. of Electrical Engineering Fax: 61-2-351 3847 University of Sydney, NSW, 2006, Australia email: andrewh@speech.su.oz.au Archive-name: comp-speech-faq/part2 Last-modified: 1995/01/19 COMP.SPEECH FAQ POSTING - PART 2/3 [Note: this document has been automatically extracted from a WWW site: http://www.speech.su.oz.au/comp.speech This may introduce some formatting errors.] =========================================================================== FAQ SECTION 2 - Signal Processing for Speech Q2.1: WHAT SAMPLING DO I NEED FOR SPEECH? For recorded speech to be understood by humans you need an 8kHz sampling rate or more and at least 8 bit sampling. This produces poor quality speech - but in can be understood. Improvements can be achieved by increasing the number of bits in sampling to 12bits or 16bits, or by using a non-linear encoding technique such as mu-law or A-law (see Q2.7). This improves the "signal-to-noise" ratio. Increasing the sampling rate above 8kHz, say to 10kHz, 16kHz or 20Khz, improves the frequency response: the higher the sampling frequency the better the high frequency content will be. A 16kHz sampling rate is a reasonable target for high quality speech recording and playback. When doing speech recognition you need to remember that the your computer is not as good as your ear so it will have trouble with poor quality sounds. The choice of an appropriate sampling setup depends very much on the speech recognition task and the amount of computer power available. _________________________________________________________________ Q2.2: HOW DO I FIND THE PITCH OF A SPEECH SIGNAL? This topic comes up regularly in the comp.dsp newsgroup. Question 2.5 of the FAQ posting for comp.dsp gives a comprehensive list of references on the definition, perception and processing of pitch. _________________________________________________________________ Q2.3: HOW DO I FIND THE START AND END POINTS OF A SPEECH SIGNAL? A large number of papers have been presented on this task. Try the following papers: * Rabiner LR, Sambur MR, "An Algorithm for Determining the Endpoints of Isolated Utterances", Bell System Technical Journal, Vol 54, No. 2, pp 297-315, 1975. * Drago, P.G. et al. "Digital Dynamic Speech Detectors." IEEE Trans on Communications, Vol 26, No 1, Jan 78, pp. 140-145. * Newman, W.C. "Detecting Speech with an Adapative Neural Network." Electronic Design. 22 March 1990. * Taboada. J et al "Explicit Estimation of Speech Boundaries" IEE Proc. Sci. Meas. Technol., Vol 141, No.3, May 1994 pp153-159. _________________________________________________________________ Q2.4: WHERE CAN I FIND FFT SOFTWARE? Try the following file available by anonymous ftp. It contains a series of optimised fft routines, including mixed-radix algorithms. The .gz suffix indicates GNU zip format. * ftp://usc.edu/pub/C-numanal/fft-stuff.tar.gz _________________________________________________________________ Q2.5: WHAT SIGNAL PROCESSING TECHNIQUES ARE USED IN SPEECH TECHNOLOGY? This question is far to big to be answered in a FAQ posting. Fortunately there are many good books which answer the question. Some good introductory books include * Digital processing of speech signals; L. R. Rabiner, R. W. Schafer. Englewood Cliffs; London: Prentice-Hall, 1978 * Voice and Speech Processing; T. W. Parsons. New York; McGraw Hill 1986 * Computer Speech Processing; ed Frank Fallside, William A. Woods Englewood Cliffs: Prentice-Hall, c1985 * Digital speech processing : speech coding, synthesis, and recognition edited by A. Nejat Ince; Kluwer Academic Publishers, Boston, c1992 * Speech science and technology; edited by Shuzo Saito pub. Ohmsha, Tokyo, c1992 * Speech analysis; edited by Ronald W. Schafer, John D. Markel New York, IEEE Press, c1979 * Douglas O'Shaughnessy -- Speech Communication: Human and Machine Addison Wesley series in Electrical Engineering: Digital Signal Processing, 1987. * Discrete-time processing of speech signals; John R Deller, John G Proakis, John H L Hansen; Macmillan 1993. * Signal processing of speech; F J Owens; Macmillan 1993. _________________________________________________________________ Q2.6: WHAT SPEECH SAMPLING AND SIGNAL PROCESSING HARDWARE CAN I USE? In addition to the following information, have a look at the Audio File format document prepared by Guido van Rossum (see details in Section 1.8). Can anyone provide information on Mac, SGI, NeXT and other hardware? Sun standard audio port: SPARC I & II * Input and Output: 1 channel, 8 bit mu-law encoded, 8kHz sample rate. This provides telephone quality sampling. Sun standard audio port (SPARC 10 & 20) * Input and Output: Stereo (2 channels). 16-bit linear sampling. Multiple sample rates (48000, 44100, 37800, 32000, 22050, 18900, 16000, 11025, 9600, 8000 Hz) Macintosh Audio Hardware - an overview * Description: ALL Macintosh computers come with the ability to play back sounds at any sample rate (sample rate conversion is done in software.) Older machines have 8 bit stereo output (hardware runs at 22254 samples/second). The newer machines have 16 bit stereo hardare running at 44100 samples/second. Most of the recent Macintosh computers come with sound input hardware. There are probably exceptions to this, but the older and some of the current low-end machines have 8 bit (linear) mono hardware running at 22254.54 samples/second. All of the PowerPC, AV, and the 500 series notebook computers come with 16 bit 44kHz stereo sampling hardware. They can also record at 22050 samples/second. The sound manager implements an AGC (Automatic Gain Control) function for the 8 bit hardware. The drivers have a switch to turn off the AGC. There are a number of DSP vendors that support high quality audio. Generally this means quieter analog sections, and more IO formats (AES/IBU, for example). Try DigiDesign and Spectral Innovations. The software drivers for sound are described in "Inside Macintosh: Sound". If you want to see some sample code check out the sources for the Matlab "Sound and Image Toolbox". They can be found at + ftp://ftp.apple.com/pub/malcolm/SoundAndImageToolbox.cpt.hqx Routines that play and record sounds using the toolbox are included (and interfaced to Matlab). Ariel Signal Processors * Platform: Various * Description: A range of signal I/O, A/D, D/A and DSP products are available. There are too many to list. * Contact: Ariel Corp. 433 River Road, Highland Park, NJ 08904. Ph: 908-249-2900 Fax: 908-249-2123 DSP BBS: 908-249-2124 IBM RS/6000 ACPA (Audio Capture and Playback Adapter) * Description: The card supports PCM, Mu-Law, A-Law and ADPCM at 44.1kHz (& 22.05, 11.025, 8kHz) with 16-bits of resolution in stereo. The card has a built-in DSP (don't know which one). The device also supports various formats for the output data, like big-endian, twos complement, etc. Good noise immunity. The card is used for IBM's VoiceServer (they use the DSP for speech recognition). Apparently, the IBM voiceserver has a speaker-independent vocabulary of over 20,000 words and each ACPA can support two independent sessions at once. * Cost: $US495 * Contact: ? Sound Galaxy NX , Aztech Systems * Platform: PC - DOS,Windows 3.1 * Cost: ? * Input: 8bit linear, 4-22 kHz. * Output: 8bit linear, 4-44.1 kHz * Misc: 11-voice FM Music Synthesizer YM3812; Built-in power amplifier; DSP signal processing support - ST70019SB Hardware ADPCM decompression (2:1,3:1,4:1) "AdLib" and "Sound Blaster" compatbility. Software includes a simple Text-to-Speech program "Monologue". Sound Galaxy NX PRO, Aztech Systems * Platform: PC - DOS,Windows 3.1 * Cost: ? * Input: 2 * 8bit linear, 4-22.05 kHz(stereo), 4-44.1 KHz(mono). * Output: 2 * 8bit linear, 4-44.1 kHz(stereo/mono) * Misc: 20-voice FM Music Synthesizer; Built-in power amplifier; Stereo Digital/Analog Mixer; Configuration in EEPROM. Hardware ADPCM decompression (2:1,3:1,4:1). Includes DSP signal processing support. "AdLib" and "Sound Blaster Pro II" compatybility. Software includes a simple Text-to-Speech program "Monologue" and Sampling laboratory for Windows 3.1: WinDAT. * Contact: USA (510)6238988 ATI Stereo F/X Sound Board * Platform: PC XT or AT - DOS, Windows 3.0, 3.1 * Cost: $120 Canadian * Description: Input - 8 bit ADC, 44.1 kHz mono, 22.05 kHz Stereo. Output - Dynamic range = 48 dB, 32 anti-aliasing filters. Adds Stereo effect to existing mono Adlib or Sound Blaster apps. 11-voice YAMAHA FM Music Synthesizer. Built-in 8 watt power amplifier, 4 watts per channel. Volume ctrl on rear. 2 Joystick input, software setup (no switches), software included. "AdLib" and "Sound Blaster" compatibility. DMA support for high speed digital audio. ADPCM decomp @ 4:1, 3:1, 2:1. Will play .WAV files. Optional MIDI I/O port $79. (MIDI IN, OUT, THRU, and sequencer). * Contact: ATI Technologies Inc. 3761 Victoria Park Avenue, Scarborough, Ontario CANADA, M1W 3S2 Ph: (416) 756-0711 Fax: (416) 756-0720 BBS: (416) 764-9404 (9600 baud N.8.1) Other PC Sound Cards ============================================================================ sound stereo/mono compatible included voices card & sample rate with ports ============================================================================ Adlib Gold stereo: 8-bit 44.1khz Adlib ? audio 20 (opl3) 1000 16-bit 44.1khz in/out, +2 digital mono: 8-bit 44.1khz mic in, channels 16-bit 44.1khz joystick, MIDI Sound Blaster mono: 8-bit 22.1khz Adlib audio 11 synth. FM synth with in/out, 2 operators joystick, Sound Blaster stereo: 8-bit 22.05khz Adlib audio 22 Pro Basic mono: 8-bit 44.1khz Sound Blaster in/out, joystick, Sound Blaster stereo: 8-bit 22.05khz Adlib audio 11 Pro mono: 8-bit 44.1khz Sound Blaster in/out joystick, MIDI, SCSI Sound Blaster stereo: 8-bit 4-44.1khz Sound Blaster audio 20 16 ASP stereo: 16-bit 4-44.1khz in/out, joystick, MIDI Audio Port mono: 8-bit 22.05khz Adlib audio 11 Sound Blaster in/out, joystick Pro Audio stereo: 8-bit 44.1khz Adlib audio, 20 Spectrum + Pro Audio in/out, Spectrum joystick Pro Audio stereo: 16-bit 44.1khz Adlib audio 20 Spectrum 16 Pro Audio in/out, Spectrum joystick, Sound Blaster MIDI, SCSI Thunder Board stereo: 8-bit 22khz Adlib audio 11 Sound Blaster in/out, joystick Gravis stereo: 8-bit 44.1khz Adlib, audio line 32 sampled Ultrasound mono: 8-bit 44.1khz Sound Blaster in/out, 32 synth. amplified out, (w/16-bit daughtercard) mic in, CD stereo: 16-bit 44.1khz audio in, mono: 16-bit 44.1khz daughterboard ports (for SCSI and 16-bit) MultiSound stereo: 16-bit 44.1kHz Nothing audio 32 sampled 64x oversampling in/out, joystick, MIDI ============================================================================= _________________________________________________________________ Q2.7: HOW DO I CONVERT TO/FROM MU-LAW FORMAT? Mu-law coding is a form of compression for audio signals including speech. It is widely used in the telecommunications field because it improves the signal-to-noise ratio without increasing the amount of data. Typically, mu-law compressed speech is carried in 8-bit samples. It is a companding technqiue. That means that carries more information about the smaller signals than about larger signals. On SUN Sparc systems have a look in the directory /usr/demo/SOUND. Included are table lookup macros for ulaw conversions. [Note however that not all systems will have /usr/demo/SOUND installed as it is optional - see your system admin if it is missing.] OR, here is some sample conversion code in C. /** ** Signal conversion routines for use with Sun4/60 audio chip **/ #include stdio.h unsigned char linear2ulaw(/* int */); int ulaw2linear(/* unsigned char */); /* ** This routine converts from linear to ulaw ** ** Craig Reese: IDA/Supercomputing Research Center ** Joe Campbell: Department of Defense ** 29 September 1989 ** ** References: ** 1) CCITT Recommendation G.711 (very difficult to follow) ** 2) "A New Digital Technique for Implementation of Any ** Continuous PCM Companding Law," Villeret, Michel, ** et al. 1973 IEEE Int. Conf. on Communications, Vol 1, ** 1973, pg. 11.12-11.17 ** 3) MIL-STD-188-113,"Interoperability and Performance Standards ** for Analog-to_Digital Conversion Techniques," ** 17 February 1987 ** ** Input: Signed 16 bit linear sample ** Output: 8 bit ulaw sample */ #define ZEROTRAP /* turn on the trap as per the MIL-STD */ #define BIAS 0x84 /* define the add-in bias for 16 bit samples */ #define CLIP 32635 unsigned char linear2ulaw(sample) int sample; { static int exp_lut[256] = {0,0,1,1,2,2,2,2,3,3,3,3,3,3,3,3, 4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4, 5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5, 5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5, 6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6, 6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6, 6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6, 6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6, 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7}; int sign, exponent, mantissa; unsigned char ulawbyte; /* Get the sample into sign-magnitude. */ sign = (sample >> 8) & 0x80; /* set aside the sign */ if (sign != 0) sample = -sample; /* get magnitude */ if (sample > CLIP) sample = CLIP; /* clip the magnitude */ /* Convert from 16 bit linear to ulaw. */ sample = sample + BIAS; exponent = exp_lut[(sample >> 7) & 0xFF]; mantissa = (sample >> (exponent + 3)) & 0x0F; ulawbyte = ~(sign | (exponent << 4) | mantissa); #ifdef ZEROTRAP if (ulawbyte == 0) ulawbyte = 0x02; /* optional CCITT trap */ #endif return(ulawbyte); } /* ** This routine converts from ulaw to 16 bit linear. ** ** Craig Reese: IDA/Supercomputing Research Center ** 29 September 1989 ** ** References: ** 1) CCITT Recommendation G.711 (very difficult to follow) ** 2) MIL-STD-188-113,"Interoperability and Performance Standards ** for Analog-to_Digital Conversion Techniques," ** 17 February 1987 ** ** Input: 8 bit ulaw sample ** Output: signed 16 bit linear sample */ int ulaw2linear(ulawbyte) unsigned char ulawbyte; { static int exp_lut[8] = {0,132,396,924,1980,4092,8316,16764}; int sign, exponent, mantissa, sample; ulawbyte = ~ulawbyte; sign = (ulawbyte & 0x80); exponent = (ulawbyte >> 4) & 0x07; mantissa = ulawbyte & 0x0F; sample = exp_lut[exponent] + (mantissa << (exponent + 3)); if (sign != 0) sample = -sample; return(sample); } _________________________________________________________________ =========================================================================== FAQ SECTION 3 - Speech Coding and Compression Q3.1: SPEECH COMPRESSION TECHNIQUES. Can anyone provide a 1-2 page summary on speech compression? Note: the FAQ for comp.compression includes a few questions and answers on the compression of speech. _________________________________________________________________ Q3.2: WHAT ARE SOME GOOD REFERENCES/BOOKS ON CODING/COMPRESSION? * Douglas O'Shaughnessy -- Speech Communication: Human and Machine Addison Wesley series in Electrical Engineering: Digital Signal Processing, 1987. * Bishnu Atal in ed. Fallside, F. and W. Woods, ed. Computer Speech Processing. London: Prentice/Hall International, 1985. * Makhoul, J. "Linear Prediction: A Tutorial Review." Proc. of the IEEE 63 (1975): 561 - 580. _________________________________________________________________ Q3.3: WHAT SPEECH COMPRESSION/CODING SOFTWARE IS AVAILABLE? Note: there are two types of speech compression technique referred to below. Lossless technqiues preserve the speech through a compression-decompression phase. Lossy techniques do not preserve the speech prefectly. As a general rule, the more you compress speech, the more the quality degardes. File format conversion * Platform: SUN OS? * Description: Conversion utility able to encode and decode between the the following formats: G.723, G.721, A-law, u-law and linear. * Availability: By anonymous ftp from + ftp://ftp.cwi.nl/pub/audio/ccitt-adpcm.tar.Z shorten - a lossless compressor for speech signals * Platform: UNIX/DOS * Description: A fast waveform coder suitable for a speech and music signals in a wide variety of file formats. The degree of compression is adjustable from lossless to three bits a sample. 16bit 16kHz speech generally attains 50% lossless compression and 16:3 compression of CDROM quality speech is obtainable with only minor audiable degredation. * Availability: Anonymous ftp - UNIX and DOS versions are in + ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/sources/shorten-1. 14.tar.Z + ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/sources/shn114.zip 32 kbps ADPCM * Platform: SGI and Sun Sparcs * Description: 32 kbps ADPCM C-source code (G.721 compatibility is uncertain) * Contact: Jack Jansen * Availablity: Anoymous ftp + ftp://ftp.cwi.nl/pub/adpcm.shar GSM 06.10 Compression * Platform: Unix; faster than real time on most Sun SPARCstations * Description: GSM 06.10 is a standardized lossy speech compression employed by most European wireless telephones. It uses RPE/LTP (residual pulse excitation/long term prediction) coding to compress frames of 160 13-bit samples (8 kHz sampling rate, i.e. a frame rate of 50 Hz) into 260 bits. * Contact: GSM 06.10 support and implementation jutta@cs.tu-berlin.de, cabo@cs.tu-berlin.de * Availability: The following configurations are available be anonymous ftp: + gzip compression from Germany: ftp://ftp.cs.tu-berlin.de/pub/local/kbs/tubmik/gsm/gsm-1.0.5. tar.gz + MS-DOS compression from Germany: ftp://ftp.cs.tu-berlin.de/pub/local/kbs/tubmik/gsm/gsm-105.zi p + MS-DOS compression from USA: ftp://ftp.mv.com/pub/ddj/1194.12/gsm-105.zip * Misc: The WWW site is + http://www.cs.tu-berlin.de/~jutta/toast.html G.711/721/723 Compression * Description: + G.711 : CCITT u-law and A-law compression + G.721 : CCITT 32 kbps ADPCM coder + G.723 : CCITT 24 kbps and 40 kbps ADPCM coders * Availability: By email to teledoc@itu.arcom.ch, with GET ITU-3022 as the *only* line in the body of the message. This is also available by anonymous ftp from: + ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/sources/G711_G721_ G723.tar.Z G.728 Compression * Description: G.728 low delay celp package written by Alex Zatsman of Analog Devices, Inc. * Availability: By anonymous ftp from + ftp://dspsun.eas.asu.edu/pub/speech/ldcelp.tgz G.728 LD-CELP vocoder * Platform: Analog Devices ADSP-2171 * Description: Real-time, full-duplex G.728 LD-CELP vocoder that runs on a single Analog Devices ADSP-2171. Source and object code available for a one-time license fee. * Contact: Cole Erskine Analogical Systems 299 California Avenue, Suite 120 Palo Alto, CA 94306, USA Tel:(415) 323-3232 FAX:(415) 323-4222 Internet: cole@analogical.com U.S.F.S. 1016 CELP vocoder for DSP56001 * Platform: DSP56001 * Description: Real-time U.S.F.S. 1016 CELP vocoder that runs on a single 27MHz Motorola DSP56001. Free demo software available for PC-56 and PC-56D. Source and object code available for a one-time license fee. * Contact: Cole Erskine Analogical Systems 299 California Avenue, Suite 120 Palo Alto, CA 94306, USA Tel:(415) 323-3232 FAX:(415) 323-4222 Email: cole@analogical.com 8 Kbit/s CELP on the TMS320C5x family of DSP chips * Description: For low bandwidth transmission of voice, compact voice storage for archival purposes, low-cost digital answering machines and efficient storage for voice mail. Features : + near toll quality at 8 Kb/s. + Variable rate option with 1 Kb/s silence encoding. + Implemented on a fixed-point processor for lower system cost. + Attractive licensing scheme. + Future availability of 4 Kb/s. + Custom rates possible. Capacity : + Two half-duplex or one full duplex channels on the 20 MIPS 'C5x (at 95% and 55% CPU utilization respectively). + Two full duplex channels on the 28.6 MIPS 'C5x (at 77% CPU utilization). + Requires 9 K-words program memory and 3 K-words data memory. + Decoding in real-time on a 486 class CPU. * Contact: CVI Inc. 443 Vienna Cres. North Vancouver, BC, Canada V7N 3B3 Tel: (604) 987 1719 Fax: (604) 986 8139 Email: cvi@extropia.wimsey.com CELP 3.2a & LPC * Platform: Sun (the makefiles & source can be modified for other platforms) * Description: CELP is lossy compression technqiue. The U.S. DoD's Federal-Standard-1016 based 4800 bps code excited linear prediction voice coder version 3.2a (CELP 3.2a) Fortran and C simulation source codes. Available for worldwide distribution (on DOS diskettes, but configured to compile on Sun SPARC stations) from NTIS and DTIC. Example input and processed speech files are included. A Technical Information Bulletin (TIB), "Details to Assist in Implementation of Federal Standard 1016 CELP," and the official standard, "Federal Standard 1016, Telecommunications: Analog to Digital Conversion of Radio Voice by 4,800 bit/second Code Excited Linear Prediction (CELP)," are also available. * Availability 1: Through the National Technical Information Service: NTIS U.S. Department of Commerce 5285 Port Royal Road, Springfield, VA 22161, USA The "AD" ordering number for the CELP software is AD M000 118 (US$ 90.00) and for the TIB it's AD A256 629 (US$ 17.50). The LPC-10 standard, described below, is FIPS Pub 137 (US$ 12.50). There is a $3.00 shipping charge on all U.S. orders. The telephone number for their automated system is 703-487-4650, or 703-487-4600 if you'd prefer to talk with a real person. (U.S. DoD personnel and contractors can receive the package from the Defense Technical Information Center: DTIC, Building 5, Cameron Station, Alexandria, VA 22304-6145. Their telephone number is 703-274-7633.) * Availability 2: By anonymous ftp from: + ftp://ftp.super.org(192.31.192.1)/pub/celp_3.2a.tar.Z + OR ftp://svr-ftp.eng.cam.ac.uk/comp.speech/sources/celp_3.2a.tar .Z * Misc: The following articles describe the Federal-Standard-1016 4.8-kbps CELP coder (it's unnecessary to read more than one): + Campbell, Joseph P. Jr., Thomas E. Tremain and Vanoy C. Welch, "The Federal Standard 1016 4800 bps CELP Voice Coder," Digital Signal Processing, Academic Press, 1991, Vol. 1, No. 3, p. 145-155. + Campbell, Joseph P. Jr., Thomas E. Tremain and Vanoy C. Welch, "The DoD 4.8 kbps Standard (Proposed Federal Standard 1016)," in Advances in Speech Coding, ed. Atal, Cuperman and Gersho, Kluwer Academic Publishers, 1991, Chapter 12, p. 121-133. + Campbell, Joseph P. Jr., Thomas E. Tremain and Vanoy C. Welch, "The Proposed Federal Standard 1016 4800 bps Voice Coder: CELP," Speech Technology Magazine, April/May 1990, p. 58-64. The U.S. DoD's Federal-Standard-1015/NATO-STANAG-4198 based 2400 bps linear prediction coder (LPC-10) was republished as a Federal Information Processing Standards Publication 137 (FIPS Pub 137). It is described in: + Thomas E. Tremain, "The Government Standard Linear Predictive Coding Algorithm: LPC-10," Speech Technology Magazine, April 1982, p. 40-49. There is also a section about FS-1015 in the book: + Panos E. Papamichalis, Practical Approaches to Speech Coding, Prentice-Hall, 1987. The voicing classifier used in the enhanced LPC-10 (LPC-10e) is described in: + Campbell, Joseph P., Jr. and T. E. Tremain, "Voiced/ Unvoiced Classification of Speech with Applications to the U.S. Government LPC-10E Algorithm," Proceedings of the IEEE International Conf. on Acoustics, Speech, and Signal Processing, 1986, p. 473-6. Copies of the official standard, "Federal Standard 1016, Tele- communications: Analog to Digital Conversion of Radio Voice by 4,800 bit/second Code Excited Linear Prediction (CELP)" are available for US$ 5.00 each from: GSA Federal Supply Service Bureau Specification Section, Suite 8100 470 E. L'Enfant Place, S.W. Washington, DC 20407 (202)755-0325 Realtime DSP code for FS-1015 and FS-1016 is sold by: John DellaMorte, DSP Software Engineering 165 Middlesex Tpk, Suite 206, Bedford, MA 01730, USA Ph: 1-617-275-3733 Fax: 1-617-275-4323 dspse.bedford@channel1.com DSP Software Engineering's FS-1016 code can run on a DSP Research's Tiger 30 (a PC board with a TMS320C3x and analog interface suited to development work). DSP Research 1095 E. Duane Ave, Sunnyvale, CA 94086, USA Ph: (408)773-1042 Fax: (408)736-3451 _________________________________________________________________ =========================================================================== FAQ SECTION 4 - Natural Language Processing There is now a newsgroup specifically for Natural Language Processing. It is called comp.ai.nat-lang. There is also a lot of useful information on Natural Language Processing in the FAQ for comp.ai. That FAQ lists available software and useful references. It includes a substantial list of software, documentation and other info available by ftp. _________________________________________________________________ Q4.1: WHAT ARE SOME GOOD REFERENCES/BOOKS ON NLP? Take a look at the FAQ for the "comp.ai" newsgroup as it also includes some useful references. * James Allen: Natural Language Understanding, (Benjamin/Cummings Series in Computer Science) Menlo Park: Benjamin/Cummings Publishing Company, 1987. + This book consists of four parts: syntactic processing, semantic interpretation, context and world knowledge, and response generation. * G. Gazdar and C. Mellish, Natural Language Processing in Prolog, Addison Wesley, 1989 * G. Gazdar and C. Mellish, Natural Language Processing in Lisp, Addison Wesley, 1989 * G. Gazdar and C. Mellish, Natural Language Processing in Pop11, Addison Wesley, 1989 + Emphasis on parsing, especially unification-based parsing, lots of details on the lexicon, feature propagation, etc. Fair coverage of semantic interpretation, inference in natural language processing, and pragmatics; much less extensive than in Allen's book, but more formal. There are three versions, one for each programming language listed above, with complete code. * Shapiro, Stuart C.: Encyclopedia of Artificial Intelligence Vol.1 and 2. New York: John Wiley & Sons, 1990. + There are articles on the different areas of natural language processing which also give additional references. * Paris, Ce'cile L.; Swartout, William R.; Mann, William C.: Natural Language Generation in Artificial Intelligence and Computational Linguistics. Boston: Kluwer Academic Publishers, 1991. + The book describes the most current research developments in natural language generation and all aspects of the generation process are discussed. The book is comprised of three sections: one on text planning, one on lexical choice, and one on grammar. * Readings in Natural Language Processing, ed by B. Grosz, K. Sparck Jones and B. Webber, Morgan Kaufmann, 1986 + A collection of classic papers on Natural Language Processing. Fairly complete at the time the book came out (1986) but now seriously out of date. Still useful for ATN's, etc. * Klaus K. Obermeier, Natural Language Processing Technologies in Artificial Intelligence: The Science and Industry Perspective, Ellis Horwood Ltd, John Wiley & Sons, Chichester, England, 1989. Journals The major journals of the field are * Computational Linguistics and Cognitive Science for the artificial intelligence aspects, * Cognition for the psychological aspects, * Language and Linguistics and Philosophy and Linguistic Inquiry for the linguistic aspects. * Artificial Intelligence occasionally has papers on natural language processing. Conferences The major conferences of the field are * ACL (held every year) * and COLING (held every two years). Most AI conferences have a NLP track; AAAI, ECAI, IJCAI and the Cognitive Science Society conferences usually are the most interesting for NLP. CUNY is an important psycholinguistic conference. There are lots of linguistic conferences: the most important seem to be NELS, the conference of the Chicago Linguistic Society (CLS), WCCFL, LSA, the Amsterdam Colloquium, and SALT. _________________________________________________________________ Q4.2: WHAT NLP SOFTWARE IS AVAILABLE? Check the comments at the start of this section for information on other newsgroups and sources of information on NLP. Natural Language Software Registry (NLSR) - NLP Tools * The Natural Language Software Registry is available from the German Research Institute for Artificial Intelligence (DFKI) in Saarbrucken. Its purpose is to facilitate the exchange and evaluation of natural language processing software within the research community. To this end, the NLSR is cataloging natural language software projects, both commercial and non- commercial. The new updated and enlarged version contains more than 100 descriptions of natural processing software. Registry listings include: + speech signal processors, such as the Computerized Speech Lab (Kay Elemetrics) + morphological analyzers, such as PC-KIMMO (Summer Institute for Linguistics) + parsers, such as Alveytools (University of Edinburgh) + semantic and pragmatic analyzer, such as NLL (University of the Saarland, Germany) + generation programs, such as FUF (Ben Gurion University of the Negev) + knowledge representation systems, such as Rhet (University of Rochester) + multicomponent systems, such as ELU (ISSCO), PENMAN (ISI), Pundit (UNISYS), SNePS (SUNY Buffalo), + NLP-Tools, such as GULP (University of Georgia) or Linguist (Kansai Research Laboratory) + applications programs (misc.) * If you have developed a piece of software for natural language processing that other researchers might find useful, you can include it by returning the questionnaire available from the sources below. * ftp://ftp.dfki.uni-sb.de/pub/registry * e-mail: registry@dfki.uni-sb.de * post: Natural Language Software Registry Deutsches Forschungsinstitut fuer Kuenstliche Intelligenz (DFKI) Stuhlsatzenhausweg 3 D-66123 Saarbruecken Germany * Other ftp sites are + ftp://crlftp.nmsu.edu/pub/non-lexical/NL_Software_Registy + ftp://dri.cornell.edu/pub/Natural_Language_Software_Registry Part of Speech Tagger * Description: A rule-based part pf speech tagger developed by Eric Brill. For a detailed description of the tagger see chapter 6 of his thesis. * Availability: The tagger and description are available by anonymous ftp from + ftp://lightning.lcs.mit.edu/pub/BRILL/Programs & Papers _________________________________________________________________ Andrew Hunt --- Speech Technology Research Group Ph: 61-2-351 4509 Dept. of Electrical Engineering Fax: 61-2-351 3847 University of Sydney, NSW, 2006, Australia email: andrewh@speech.su.oz.au Archive-name: comp-speech-faq/part3 Last-modified: 1995/01/19 COMP.SPEECH FAQ POSTING - PART 3/3 [Note: this document has been automatically extracted from a WWW site: http://www.speech.su.oz.au/comp.speech This may introduce some formatting errors.] =========================================================================== FAQ SECTION 5 - Speech Synthesis Q5.1: WHAT IS SPEECH SYNTHESIS? Speech synthesis is the task of transforming written input to spoken output. The input can either be provided in a graphemic/orthographic or a phonemic script, depending on its source. _________________________________________________________________ Q5.2: HOW CAN SPEECH SYNTHESIS BE PERFORMED? There are several algorithms. The choice depends on the task they're used for. The easiest way is to just record the voice of a person speaking the desired phrases. This is useful if only a restricted volume of phrases and sentences is used, e.g. messages in a train station, or schedule information via phone. The quality depends on the way recording is done. More sophisticated but worse in quality are algorithms which split the speech into smaller pieces. The smaller those units are, the less are they in number, but the quality also decreases. An often used unit is the phoneme, the smallest linguistic unit. Depending on the language used there are about 35-50 phonemes in western European languages, i.e. there are 35-50 single recordings. The problem is combining them as fluent speech requires fluent transitions between the elements. The intellegibility is therefore lower, but the memory required is small. A solution to this dilemma is using diphones. Instead of splitting at the transitions, the cut is done at the center of the phonemes, leaving the transitions themselves intact. This gives about 400 elements (20*20) and the quality increases. The longer the units become, the more elements are there, but the quality increases along with the memory required. Other units which are widely used are half-syllables, syllables, words, or combinations of them, e.g. word stems and inflectional endings. _________________________________________________________________ Q5.3: WHAT ARE SOME GOOD REFERENCES/BOOKS ON SYNTHESIS? The following are good introductory books/articles. * Douglas O'Shaughnessy -- Speech Communication: Human and Machine Addison Wesley series in Electrical Engineering: Digital Signal Processing, 1987. * D. H. Klatt, "Review of Text-To-Speech Conversion for English", Jnl. of the Acoustic Society of America (JASA), v82, Sept. 1987, pp 737-793. * "Talking Machines, Theories, Models and Designs" Eds, G. Bailly & C. Benoit (Elsevier: North Holland) * I. H. Witten. Principles of Computer Speech. (London: Academic Press, Inc., 1982). * John Allen, Sharon Hunnicut and Dennis H. Klatt, "From Text to Speech: The MITalk System", Cambridge University Press, 1987. _________________________________________________________________ Q5.4: WHAT SPEECH SYNTHESIS SOFTWARE/HARDWARE IS AVAILABLE? Please email any updates, corrections or additions to the following list. The range of commercially available synthesis software is growing rapidly so any help in keeping up to date will be appreciated. Orator Text-to-Speech Synthesizer * Platform: SUN SPARC, Decstation 5000. Written in C, and therefore portable to other UNIX platforms. Some successful ports: HP, RS-6000, PC-Unix [Linux]. * Description: Sophisticated speech synthesis package. Has text preprocessing (for abbreviations, numbers), acronym rules, and human-like spelling routines. Natural-sounding synthesis based on demisyllable concatenation. Has high accuracy for pronunciation of names of people, places and businesses in America; good accuracy for English text; rules for stress and intonation marking; various methods of user control and customization at most stages of processing. A new version of the ORATOR system is under development. Both ORATOR and this new "ORATOR II" system are capable of very good general text synthesis. The ORATOR II system has a more natural-sounding voice. * Hardware: Runs on common SPARC or Decstation workstations, using their internal audio output capability. Recommend at least 16M of memory. * Availability and Pricing: Contact Bellcore's Licensing Office (1-800-527-1080) or email Anthony Lindsey alin1@panix.com Text to phoneme program (1) * Platform: unknown * Description: Text to phoneme program. Based on Naval Research Lab's set of text to phoneme rules. * Availability: by anonymous ftp + ftp://shark.cse.fau.edu/pub/src/phon.tar.Z Text to phoneme program (2) * Platform: unknown * Description: Text to phoneme program. * Availability: by anonymous ftp + ftp://wuarchive.wustl.edu/mirrors/unix-c/utils/phoneme.c Text to phoneme program (3) * Description: A public domain version of the same Naval Research Lab text to phoneme rules. * Availability: By anonymous ftp + ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/sources/english2ph oneme.shar Text to speech program * Description: A implementation of the Klatt phoneme to waveform speech synthesiser. * Availability: By anonymous ftp + ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/sources/klatt-0.02 .tar.Z "Speak" - a Text to Speech Program * Platform: Sun SPARC * Description: Text to speech program based on concatenation of pre-recorded speech segments. A function library can be used to integrate speech output into other code. * Hardware: SPARC audio I/O * Availability: by anonymous ftp + ftp://wilma.cs.brown.edu/pub/speak.tar.Z TheBigMouth - a Text to Speech Program * Platform: NeXT * Description: Text to speech program based on concatenation of pre-recorded speech segments. NeXT equivalent of "Speak" for Suns. * Availability: try NeXT archive sites such as sonata.cc.purdue.edu. TextToSpeech Kit * Platform: NeXT Computers * Description: The TextToSpeech Kit does unrestricted conversion of English text to synthesized speech in real-time. The user has control over speaking rate, median pitch, stereo balance, volume, and intonation type. Text of any length can be spoken, and messages can be queued up, from multiple applications if desired. Real-time controls such as pause, continue, and erase are included. Pronunciations are derived primarily by dictionary look-up. The Main Dictionary has nearly 100,000 hand-edited pronunciations which can be supplemented or overridden with the User and Application dictionaries. A number parser handles numbers in any form. A letter-to-sound knowledge base provides pronunciations for words not in the Main or customized dictionaries. Dictionary search order is under user control. Special modes of text input are available for spelling and emphasis of words or phrases. The actual conversion of text to speech is done by the TextToSpeech Server. The Server runs as an independent task in the background, and can handle up to 50 client connections. * Misc: The TextToSpeech Kit comes in two packages: the Developer Kit and the User Kit. The Developer Kit enables developers to build and test applications which incorporate text-to-speech. It includes the TextToSpeech Server, the TextToSpeech Object, the pronunciation editor PrEditor, several example applications, phonetic fonts, example source code, and developer documentation. The User Kit provides support for applications which incorporate text-to-speech. It is a subset of the Developer Kit. * Hardware: Uses standard NeXT Computer hardware. * Cost: + TextToSpeech User Kit: $175 CDN ($145 US) + TextToSpeech Developer Kit: $350 CDN ($290 US) + Upgrade from User to Developer Kit: $175 CDN ($145 US) * Availability: Trillium Sound Research 1500, 112 - 4th Ave. S.W., Calgary, Alberta, Canada, T2P 0H3 Tel: (403) 284-9278 Fax: (403) 282-6778 Order Desk: 1-800-L-ORATOR (US and Canada only) Email: TTSInfo@trillium.ab.ca SGI Developers Toolbox Synthesiser * Platform: SGI * Description: The SGI Developer Toolbox 4.0 CDROM contains a basic public domain text-to-speech program in the publics/speak directory. The directory includes man pages and source. * Availability: on the SGI Developer Toolbox 4.0 CDROM rsynth * Platform: Various (including Solaris2.3, SunOS4.1.3, HPUX, SGI Irix4.x, Linux) * Description:Public domain text-to-speech systm assembled from a variety of sources. It supports CMU and "beep" format dictionaries and now utilises stress marks in the dictionary in synthesising intonation. * Price: Free * Availability: by anonymous ftp from + ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/sources/rsynth-2.0 .tar.Z + ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/sources/rsynth-2.0 .tar.gz SENSYN speech synthesizer * Platform: PC, Mac, Sun, and NeXt * Rough Cost: $300 * Description: This formant synthesizer produces speech waveform files based on the (Klatt) KLSYN88 synthesizer. It is intended for laboratory and research use. Note that this is NOT a text-to-speech synthesizer, but creates speech sounds based upon a large number of input variables (formant frequencies, bandwidths, glottal pulse characteristics, etc.) and would be used as part of a TTS system. Includes full source code. * Availability: Sensimetrics Corporation 64 Sidney Street, Cambridge MA 02139. Fax: (617) 225-0470; Tel: (617) 225-2442. Email: sensimetrics@sens.com spchsyn.exe * Platform: PC? * Availability: By anonymous ftp as a self extracting DOS archive. + ftp://evans.ee.adfa.oz.au/mirrors/tibbs/applications/spchsyn.e xe * Requirements: May require special TI product(s), but all source is there. CSRE: Canadian Speech Research Environment * Platform: PC * Cost: Distributed on a cost recovery basis. * Description: CSRE is a software system which includes in addition to the Klatt speech synthesizer, SPEECH ANALYSIS and EXPERIMENT CONTROL SYSTEM. A paper about the whole package can be found in: + Jamieson D.G. et al, "CSRE: A Speech Research Environment", Proc. of the Second Intl. Conf. on Spoken Language Processing, Edmonton: University of Alberta, pp. 1127-1130. * Hardware: Can use a range of data aqcuisition/DSP hardware. * Availability: For more information contact Krystyna Marciniak email march@uwovax.uwo.ca Tel (519) 661-3901 Fax (519) 661-3805. For technical information email ramji@uwovax.uwo.ca * Note: A more detailed description is given in Section 1.9 on speech environments. Eloquence (currently an alpha release) * Platform: Windows and Solaris * Description: Software based text-to-speech package. Generates waveforms completely algorithmically instead of by concatenating waveforms, for maximum flexibility and naturalism. For instance, when the user requests a deeper voice, the software simulates a larger vocal tract, instead of simply pitch-shifting samples. Uses high-level linguistic parsing, which obviates the need for a huge dictionary. Handles numbers, acronyms, currency, etc. Includes a set of annotation symbols, for placing stress on particular words, expressing excitement/boredom, etc. Also allows phonetic input. The final version, including support for Windows DDE and OLE and UNIX Sockets, will be released by the end of 1994. Produces male and female voices for General American English. Dialects under development include Alabama, Brooklyn, and Boston. * Price: $5000 (unconfirmed) * Availability: Eloquent Technology, Inc. 2389 North Triphammer Road Ithaca, NY 14850 Ph: (607) 607-266-7025 Fax: (607) 607-266-7030 Email: eti@plab.dmll.cornell.edu JSRU * Platform: UNIX and PC * Cost: 100 pounds sterling (from academic institutions and industry) * Description: A C version of the JSRU system, Version 2.3 is available. It's written in Turbo C but runs on most Unix systems with very little modification. A Form of Agreement must be signed to say that the software is required for research and development only. * Contact: Dr. E.Lewis eric.lewis@bristol.ac.uk) Klatt-style synthesiser * Platform: Unix * Cost: Free * Description: Software posted to comp.speech in late 1992. * Availability: By anonymous ftp from the comp.speech archives + ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/sources/klatt-0.02 .tar.Z DECTalk * Description: Speech synthesis hardware and software. Detailed information on DECtalk and other DEC products is available on a World-Wide Web site. + http://www.digital.com/info.html For specific information on DECtalk, check out this www url: + http://www.digital.com/archive/pub/Digital/info/Customer-Updat e/940620005.txt Speech Manager and PlainTalk * Platform: Macintosh * Cost: Free * Description: Apple's new text-to-speech system extension(s) that enable applications (listed below) to perform text-to-speech conversion. The Speech Manager runs on most Macs, but PlainTalk (and the high quality voices) requires a 68020 Mac or better. * Availability: By anonymous ftp from: + ftp://ftp.apple.com/dts/mac/sys.soft/speech There are 3 files in this directory: 6273632 Aug 14 22:51 macintalk-pro.hqx PlainTalk Text-To-Speech 1.0 speech synthesizer extension (includes Female Voice, Compressed); TTS Female Voice; TTS Male Voice; and TTS Male Voice, Compressed. Requires 68020 or better! 370108 Aug 13 04:30 speech-manager-docs.hqx Apple DocViewer format (Inside Macintosh style, no installation instructions - just drag everything onto your closed System Folder). 262569 Aug 7 07:01 speech-manager.hqx Speech Manager 1.1.1 (includes Marvin's voice) and MacInTalk Voices 1.1.1 (9 more voices). Runs most Macs. Various Mac Speech Output Applications * Platform: Macintosh * Cost: Free (except for At Ease) * Description: Some of the Speech Manager aware text-to-speech (TTS) applications, etc. are listed below (there are more on the Apple Developer CD-ROMs). Application, etc. Source Comments _________________ ________ _________________________________________________ AddressSpeech info-mac 4D talking address book (from Speech Pack 2.0) At Ease 2.0 MacWarehouse Friendly desktop that speaks file names At Ease 2.0 WG MacWarehouse Friendly desktop that speaks file names Eliza 3.1 AOL Talking Eliza (Rogerian psych therapist) FB speech Inside Basic Mag, volume 3, no. 6. FutureBasic demo FB Speech demo Inside Basic Mag, volume 3, no. 7. FutureBasic demo Fortune 1.1 info-mac Like a talking UNIX fortune command - slick Homer 0.92d9 zaphod.ee.pitt.edu GUI IRC client, assign nicks voices - slick MacMessage 1.0 FirstClassBBS Share talking messages/customizable startup Say info-mac MPW Tool which converts standard input to speech ScriptTools 1.2 info-mac Write AppleScript scripts to say text messages Siege Watch 1.01f info-mac Wryly political speaking clock SoToSpeak1.0.0b10 info-mac Two voice conversation (also see Fortune's About) Speak It! info-mac Type in a message and have it spoken Speaker 1.11 info-mac Simple text file editor, speaks on CR, macros Speecher 1.2.1 info-mac Customizable word pronunciation/substitution SpeechManagerdemo info-mac Command line interface, C source, aka -explorer Speech Pack 2.0 info-mac 4th Dimension external, add speech to database SpeechUnitEx info-mac Pascal source code for speech in Lab 7 speek-02b info-mac Speech XCMD for HyperCard TalkingClockPro2.0info-mac AppleScriptable talking clock extension (2.0b0) TeachText 7.2 AV Mac Apple's talking TeachText (simple editor w/QT) Tex-Edit 1.9 AOL Talking word processor, McSink like, modeming VoiceDemo 1.0.1 info-mac Bare bones phrase talker Welcome!v1.3.1 info-mac A talking Welcome to Macintosh startup ? ? Talking Plug-In-Module for MS Word 5, experimental, unsupported, buggy, beware! Speech Rhythms AOL A cool text file for one of the above apps _____ * Sources: + AOL = America Online + info-mac = {ftp sumex-aim.stanford.edu, ftp wuarchive.wustl.edu, et al.} + MacWarehouse = (800) 255-6227 * Misc: Apple's work in spoken language technologies and systems is described in: + Lee, Kai-Fu. "The Conversational Computer: An Apple Perspective." (Keynote Speech) In Proc. Eurospeech in Berlin, September, 1993. MacinTalk * Platform: Macintosh * Cost: Free * Description: Formant based speech synthesis. There is also a program called "tex-edit" which apparently can pronounce English sentences reasonably using Macintalk. * Note: MacinTalk doesn't run reliably on Macintosh's with new sound hardware under the lastest OS (System 7.1 w/HUD 2.0). More recent software is listed above. * Availability: By anonymous ftp from many archive sites (have a look on archie if you can). tex-edit is on many of the same sites. Try + ftp://wuarchive.wustl.edu/mirrors2/info-mac/Old/card/macintalk .hqx + ftp://wuarchive.wustl.edu/mirrors2/info-mac/Old/card/macintalk -stack.hqx + ftp://wuarchive.wustl.edu/mirrors2/info-mac/app/tex-edit-15.hq x Monologue by Creative Labs * Platform: PC Windows plus SoundBlaster 16 * Cost: $99.00 or free with some MultiMedia packages * Description: Phoneme based speech synthesis software which provides output on Sound Blaster compatible audio cards. It includes a dictionary of words that are "exceptions" together with a a dictionary manager for modifying those words. It can be used as a stand alone program with Windows' Clipboard or as a DDE server dynamically linked (DLL) to a program you write. * Cost: $99.00 or free with some MultiMedia packages * Contact: Creative Labs Inc. 1901 McCarthy Boul, Milpitas, CA 95035, USA Tel: 408-428-6622 Fax: 408-428-6633 BBS: 408-428-6660 OR Creative Technology Ltd. 67 Ayer Rajah Crescent #03-18, Singapore 0513 Tel: 65-870-0433 Fax: 65-773-0353 BBS: 65-776-2423 Lernout & Hauspie Text-To-Speech SDK * Platform: IBM-Compatible * Description: The L&H; Text-to-Speech software developers kit is able to integrate text-to-speech technology with your own or existing PC applications under Microsoft Windows 3.1. This software will allow conversion of written text into clear human sounding synthetic speech. * Requirements: IBM-compatible PC 386 DX(33Mhz) or higher, 8Mb RAM, MS DOS 5.0(or higher), MS Windows 3.1 (or higher), Compiler and linker: Microsoft(R) Visual C++ or Borland C++, Windows(TM) 3.1 compatible sound card, preferably 16 bit e.g. Soundblaster, Windows Sounds System, Pro Audio Spectrum * Price: Unconfirmed $1,999 per copy, and $499 per each additional language (American English, French, German, or Spanish). * Contact: USA (617) 932-4118 Tinytalk * Platform: PC * Description: Shareware package is a speech 'screen reader' which is used by many blind users. * Availability: By anonymous ftp + ftp://handicap.shel.isc-br.com/speech Get the files ttexe166.zip and ttdoc166.zip. Narrator - narrator.device * Platform: Amiga * Description: Formant based speech synthesis. Includes a Engish-to-phoneme translation library, and a SPEAK: pseudo-device for speech output. * Hardware: Standard Amiga hardware * Availability: Part of AmigaOS Infovox Product Range * Description: Multilingual Text-to-speech systems, languages available: American English, British English, German, French, Spanish, Italian, Swedish, Norwegian, Icelandic, Danish and Finnish. * Product name: INFOVOX 500, PC BOARD + Product description: Half length expansion board for IBM PC, XT, AT, PS/2 model 30 or compatible personal computers. The board can also be connected via the serial port. Language and control program for downloading into RAM or mounted on EPROMs. + Platform: for IBM PC, XT, AT, PS/2 model 30 or compatible * Product name: INFOVOX 600, OEM BOARD + Product description: OEM board built with CMOS IC's. Language and control program are stored in on-board fixed memory. + Platform: any, Interface: 9-pole D-SUB (RS 232-C) 300-9600 Baud * Product name: INFOVOX 700, DESKTOP UNIT + Product description: Desktop unit with built in Infovox 600 to be connected to any computer or terminal via an RS 232-C serial interface. Built in loudspeaker and rechargable battery for 4 hours use, and control knobs for continuous control of speech volume and speed. + Platform: any * Product name: INFOVOX 650, OEM BOARD + Product description: OEM-board built with CMOS IC's. Language and control program are stored in on-board memory. + Platform:any, Interface: 9 pole D-SUB (RS 232-C) 300-9600 Baud * Product name: INFOVOX 750, DESKTOP UNIT + Product description: Desktop unit with built in Infovox 650 to be connected to any computer or terminal via an RS 232-C serial interface. Built in loudspeaker and rechargable battery for 5 hours use, and a control knob for continuous control of speech volume. + Platform: any * Misc: Infovox multi-lingual Text-to-Speech Technologies can interface with Apple's PlainTalk System. It enables Apple Third party developers to write application software with synthetic speech output using their usual Apple Plain Talk Text-to-Speech interface. Software already written for the English speaking market using Apple Plain Talk can be now distributed worldwide, provided message strings are translated. * Contact: Telia Promotor Infovox AB TTS Sales Division P.O. Box 2069 S-171 02 Solna, Sweden Ph: +46 8 764 35 00 Fax: +46 8 735 78 76 email: tts-sales@infovox.se SIMTEL-20 * The following is a list of speech related software available from SIMTEL-20 and its mirror sites for PCs. * The SIMTEL internet address is WSMR-SIMTEL20.Army.Mil [192.88.110.20] Try looking at your nearest archive site first. [Note: problems have been reported in accessing this site - does anyone know a new address?] Directory PD1: MSDOS.VOICE Filename Type Length Date Description ============================================== AUTOTALK.ARC B 23618 881216 Digitized speech for the PC CVOICE.ARC B 21335 891113 Tells time via voice response on PC HEARTYPE.ARC B 10112 880422 Hear what you are typing, crude voice synth. HELPME2.ARC B 8031 871130 Voice cries out 'Help Me!' from PC speaker SAY.ARC B 20224 860330 Computer Speech - using phonemes SPEECH98.ZIP B 41003 910628 Build speech (voice) on PC using 98 phonemes TALK.ARC B 8576 861109 BASIC program to demo talking on a PC speaker TRAN.ARC B 39766 890715 Repeats typed text in digital voice VDIGIT.ZIP B 196284 901223 Toolkit: Add digitized voice to your programs VGREET.ARC B 45281 900117 Voice says good morning/afternoon/evening _________________________________________________________________ =========================================================================== FAQ SECTION 6 - Speech Recognition Q6.1: WHAT IS SPEECH RECOGNITION? Automatic speech recognition is the process by which a computer maps an acoustic speech signal to text. Automatic speech understanding is the process by which a computer maps an acoustic speech signal to some form of abstract meaning of the speech. _________________________________________________________________ Q6.2: HOW CAN I BUILD A VERY SIMPLE SPEECH RECOGNISER? Doug Danforth provides a detailed account in article 253 in the comp.speech archives. A summary is provided below. It is also available by anonymous ftp * ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/info/DIY_SpeechRecognit ion QUICKY RECOGNIZER sketch: Here is a simple recognizer that should give you 85%+ recognition accuracy. The accuracy is a function of the words you have in your vocabulary. Long distinct words are easy. Short similar words are hard. You can get 98+% on the digits with this recognizer. Overview: * Find the begining and end of the utterance. * Filter the raw signal into frequency bands. * Cut the utterance into a fixed number of segments. * Average data for each band in each segment. * Store this pattern with its name. * Collect training set of about 3 repetitions of each pattern (word). * Recognize unknown by comparing its pattern against all patterns in the training set and returning the name of the pattern closest to the unknown. Many variations upon the theme can be made to improve the performance. Try different filtering of the raw signal and different processing methods. Q6.7 contains information on public domain speech recognition software: Lotec and Myers' Hidden Markov Model software. _________________________________________________________________ Q6.3: WHAT DOES SPEAKER DEPENDENT/ADAPTIVE/INDEPENDENT MEAN? A speaker dependent system is developed to operate for a single speaker. These systems are usually easier to develop, cheaper to buy and more accurate, but not as flexible as speaker adaptive or speaker independent systems. A speaker independent system is developed to operate for any speaker of a particular type (e.g. American English). These systems are the most difficult to develop, most expensive and accuracy is lower than speaker independent systems. However, they are more flexible. A speaker adaptive system is developed to adapt its operation to the characteristics of new speakers. It's difficulty lies somewhere between speaker independent and speaker dependent systems. _________________________________________________________________ Q6.4: WHAT DOES SMALL/MEDIUM/LARGE/VERY-LARGE VOCABULARY MEAN? The size of vocabulary of a speech recognition system affects the complexity, processing requirements and the accuracy of the system. Some applications only require a few words (e.g. numbers only), others require very large dictionaries (e.g. dictation machines). There are no established definitions, however, try * small vocabulary - tens of words * medium vocabulary - hundreds of words * large vocabulary - thousands of words * very-large vocabulary - tens of thousands of words. _________________________________________________________________ Q6.5: WHAT DOES CONTINUOUS SPEECH OR ISOLATED-WORD MEAN? An isolated-word system operates on single words at a time - requiring a pause between saying each word. This is the simplest form of recognition to perform because the end points are easier to find and the pronunciation of a word tends not affect others. Thus, because the occurrences of words are more consistent they are easier to recognise. A continuous speech system operates on speech in which words are connected together, i.e. not separated by pauses. Continuous speech is more difficult to handle because of a variety of effects. First, it is difficult to find the start and end points of words. Another problem is "coarticulation". The production of each phoneme is affected by the production of surrounding phonemes, and similarly the the start and end of words are affected by the preceding and following words. The recognition of continuous speech is also affected by the rate of speech (fast speech tends to be harder). _________________________________________________________________ Q6.6: HOW IS SPEECH RECOGNITION PERFORMED? A wide variety of techniques are used to perform speech recognition. There are many types of speech recognition. There are many levels of speech recognition / analysis / understanding. Typically speech recognition starts with the digital sampling of speech. The next stage is acoustic signal processing. Most techniques include spectral analysis; e.g. LPC analysis, MFCC, cochlea modelling and many, many more. The next stage is recognition of phonemes, groups of phonemes and words. This stage can be achieved by many processes such as DTW (Dynamic Time Warping), HMM (hidden Markov modelling), NNs (Neural Networks), expert systems and combinations of techniques. HMM-based systems are currently the most commonly used and most successful approach. Most systems utilise some knowledge of the language to aid the recognition process. Some systems try to "understand" speech. That is, they try to convert the words into a representation of what the speaker intended to mean or achieve by what they said. _________________________________________________________________ Q6.7: WHAT ARE SOME GOOD REFERENCES/BOOKS ON SPEECH RECOGNITION? Some reviews of speech recognition for personal computers: * "Seybold Report on Desktop Publishing" published a nine-page, head-to-head comparison of Dragon's DOS software with IBM's OS/2 software. March 7, 1994; Volume 8, Number 7; Pages 3-11; ISSN:0889-9762; Seybold Publications, P.O. Box 644, Media, PA 19063 USA, phone (610) 565-2480. * McGraw-Hill Inc.'s "BYTE, the Magazine of Technology Integration," published a two-page review of IBM's Personal Dictation System software. May 1994; Volume ?, Number ?; Pages 145-146; ISSN:0360-5280; Editorial, Executive, and Circulation address: One Phoenix Mill Lane, Peterborough, NH 03458 USA, phone ? Some general introduction books on speech recognition technology: * Fundamentals of Speech Recognition; Lawrence Rabiner & Biing-Hwang Juang Englewood Cliffs NJ: PTR Prentice Hall (Signal Processing Series), c1993 ISBN 0-13-015157-2 * Speech recognition by machine; W.A. Ainsworth London: Peregrinus for the Institution of Electrical Engineers, c1988 * Speech synthesis and recognition; J.N. Holmes Wokingham: Van Nostrand Reinhold, c1988 * Douglas O'Shaughnessy -- Speech Communication: Human and Machine Addison Wesley series in Electrical Engineering: Digital Signal Processing, 1987. * Electronic speech recognition: techniques, technology and applications edited by Geoff Bristow, London: Collins, 1986 * Readings in Speech Recognition; edited by Alex Waibel & Kai-Fu Lee. San Mateo: Morgan Kaufmann, c1990 More specific books/articles: * Hidden Markov models for speech recognition; X.D. Huang, Y. Ariki, M.A. Jack. Edinburgh: Edinburgh University Press, c1990 * Automatic speech recognition: the development of the SPHINX system; by Kai-Fu Lee; Boston; London: Kluwer Academic, c1989 * Prosody and speech recognition; Alex Waibel (Pitman: London) (Morgan Kaufmann: San Mateo, Calif) 1988 * S. E. Levinson, L. R. Rabiner and M. M. Sondhi, "An Introduction to the Application of the Theory of Probabilistic Functions of a Markov Process to Automatic Speech Recognition" in Bell Syst. Tech. Jnl. v62(4), pp1035--1074, April 1983 * R. P. Lippmann, "Review of Neural Networks for Speech Recognition", in Neural Computation, v1(1), pp 1-38, 1989. _________________________________________________________________ Q6.8: WHAT SPEECH RECOGNITION PACKAGES ARE AVAILABLE? The following packages are presented in no particular order. HM2007 - Speech Recognition Chip * Description: HM2007 is a 48-pin single chip CMOS voice recognition LSI circuit with on-chip analog front end, voice analysis, recognition process and system control functions. A 40 word isolated-word voice recognition system can be composed of an external microphone, keyboard, SRAM and a few other components. When combined with a microprocessor, an intelligent recognition system can be built. A demo board for this chip is being distributed by The Summa Group. * Cost: Approx US$30 for the HM2007 and US$100 for the demo board. * Warning: Several people have reported problems in obtaining small numbers of this chip (say less than 10). It appears that the distributors (include the one listed below) are only interested in large volumes. If you know of a good source please send it in for inclusion in the FAQ. * Contact: The Summa Group Limited One California Street, Suite #1940, San Francisco, CA 94111 Ph: (415) 288-0390 Voice Blaster Ver. 4.0 * Platform: IBM AT or higher, DOS or Wndows 3.1 * Description: Uses a Sound Blaster or compatible board. Contains a microphone headset and a connector for LPT1:. A printer can still be used on LPT1:. Will recognize 1024 words that are trained by the operator. Each word activates a macro that can enter an ascii word on the screen or into a word processor or invoke a batch file. An optional footswitch may be installed. Software to run under DOS or Windows 3.1 is included. * Cost: Around $150 Canadian. * Contact: COVOX Inc. 675 Conger Street Eugene, Oregon, 97402, USA Ph: (503) 342-1271 Fax: (503) 342-1283 BBS: (503) 342-4135 Votan * Platform: MS-DOS, SCO UNIX * Description: Isolated word and continuous speech modes, speaker dependant and (limited) speaker independent. Vocab size is 255 words or up to a fixed memory limit - but it is possible to dynamically load different words for effectively unlimited number of words. * Rough Cost: Approx US $1,000-$1,500 * Requirements: Cost includes one Votan Voice Recognition ISA-bus board for 386/486-based machines. A software development system is also available for DOS and Unix. * Misc: Up to 8 Votan boards may co-exist for 8 simultaneous voice users. A telephone interface is also available. There is also a 4GL and a software development system. Apparently there is more than one version - more info required. * Contact: 800-877-4756, 510-426-5600 Entropic's HTK (HMM Toolkit) * Platform: Range of Unix platforms. * Description: HTK is a software toolkit for building continuous density HMM based speech recognisers. It consists of a number of library modules and a number of tools. Functions include speech analysis, training tools, recognition tools, results analysis, and an interactive tool for speech labelling. Many standard forms of continuous density HMM are possible. Can perform isolated word or connected word speech recognition. It van model whole words, sub- word units. Can perform speaker verification and other pattern recognition work using HMMs. HTK is now integerated with the ESPS/Waves speech research environment which is described in Section 1.8. * Misc: The availability of HTK changed in early 1993 when Entropic obtained exclusive marketing rights to HTK from the developers at Cambridge. * Cost: On request. * Contact: Entropic Research Laboratory, 600 Pennsylvania Ave, S.E. Suite 202, Washington, D.C. 20003, USA Phone: (202) 547-1420. email - info@entropic.com DragonDictate version 3.0 * Platform: PC * Description: Speaker-adaptive recognition system for discrete speech. Provides 110,000 word dictionary and also allows user to add words. Active vocabulary of 5,000, 30,000, or 60,000 words. Allows dictation into almost all DOS applications (word processors, spreadsheets, etc.) and hands-free operation of the PC. * Cost:Prices including audio board and high-quality headset microphone: + US$695 (5,000 word Starter Edition) + US$995 (30,000 word Classic Edition) + US$1,995 (60,000 word Power Edition) * Requirements: Minimum of 33 Mhz 486 with 8-16M memory and at least 29M disk space (depending on product), one 8-bit slot, DOS 5.0 and up (also runs in a DOS box under Windows or OS/2). * Contact: Dragon Systems, Inc. 320 Nevada Street Newton, MA 02160, USA Tel: 1-617-965-5200, Fax: 1-617-527-0372 DragonDictate for Windows * Platform: PC * Description: Speech-to-text dictation system. Discrete speech; speaker- adaptive. Also provides command/control and mouse movement for hands-free operation of Windows. Comes with a 120,000 word pronunciation dictionary; users can also add their own words or phrases. Dictate directly into any application. * Rough Cost:Prices including software, documentation and microphone: + DragonDictate Starter Edition (5,000 words active) -- $395 + DragonDictate Classic Edition (30,000 words active) -- $695 + DragonDictate Power Edition (60,000 words active) -- $1,695 * Requirements: 486/33, 7-10 MB dedicated RAM (depending on edition), Windows 3.1 or later. Supported sound boards: Media Vision Pro Audio Studio 16, Creative Labs Sound Blaster 16, Microsoft Windows Sound System, IBM Audio Capture/Playback Adapter. * Contact: Dragon Systems, Inc. 320 Nevada Street Newton, MA 02160, USA Phone: (617)965-5200 Fax: (617)527-0372 DragonVoiceTools * Platform: PC * Description: Programmer's toolkit for developing speech-aware DOS or Windows applications. Recognizes continuously spoken digits and discretely spoken words or phrases. Up to 1,000 words can be active at one time. Use words from 110,000 word dictionary (included) and/or develop your own word models. * Cost: + US$1,995 (developer's kit) + US$595 (end-user system) * Requirements: Minimum of 20 Mhz 386 (larger vocabulary requires faster processor) with at least 5M memory and at least 19M disk space (depending on vocabulary size), DOS 5.0 and up, Windows 3.1 and up, Borland C or C++ or Microsoft C or C++. Also requires IBM M-ACPA card available from IBM or Dragon Systems ($325). * Contact: Dragon Systems, Inc. 320 Nevada Street, Newton, MA 02160, USA Tel: 1-617-965-5200, Fax: 1-617-527-0372 IBM VoiceType Dictation OR: Osborne Personal Dictation System (in Australia) * Platform: Intel I486 & IBM OS/2 * Description: Independent Speaker, discrete speech dictation with navigation. Navigation does not require setup, most applications are automatically speech enabled by dynamic control analysis. Dictation averages 70WPM with 95% accuracy and uses statistical trigram modelling. The base system is 22K words, other vocabularies available for specific industries. * Requirements: 486SX or above, 16MB Ram, 30MB File space, Dictation Adapter * Cost: Software $495 (includes mic) / Hardware $495 * Misc 1: A Windows version is now available. * Misc 2: Based on IBM Tangora Technology * Availability: US English. Other languages (UK, FR, GR, IT, and ES) available 3Q94. * Contact: US Contact 1-800-TALK-2-ME or 1-914-766-9252. VoiceServer for Windows * Platform: PC * Description: Speaker dependent, each with an independent directory. Isolated word. Upto 1000 words/user, 300 words/window. 1 word occupies 2Kb on hard disk. Can be used to control Windows applications by issuing voice commands instead of menu selection. * Rough Cost: 292 Pounds(UK) * Requirements: None * Misc: Price includes a half-sized AT voice card (including a DSP), software, documentation & a microphone (attachable to keyboard or speaker). A light-weight high-spec headset is an optional extra. * Contact: Mark Redwood Applied Voice Technologies 26 Danbury Street, Islington, London, UK, N1 8JU Ph: + 44 71 454 1224 : Fax: + 44 71 454 1225 IN3 Voice Command for Windows * Platform: PC with Windows 3.1 * Description: IN3 is now available for MS-Windows. Users can call applications to the foreground with voice commands. Once the application is called, the user may enter commands and data with voice commands. Voice macros can reduce the strain of repetitive stress injuries (RSI) such as Carpel Tunnel Syndrome (CTS) by replacing heavy repetitive keyboard hammering with simple voice operations. Voice macros take complex operations and reduce them to simple verbal commands. Voice input can provide new facilities for tasks which could not easily have been otherwise performed without the multiple axis of input. IN3 is hardware-independent, users with any Windows-compatible audio add speech recognition to the desktop. IN3 works with either 8 bit or 16 bit Windows audio boards. IN3 is based on continuous word-spotting technology. A developer API is also available for creating voice-enabled applications. * Price: $179 U.S. * Requirements: PC with 80386 processor or better, Microsoft Windows 3.1, and Windows compatible audio system with microphone. * Misc: Fully functional demos are available on Compuserve in various Multimedia and CAD forums. Demos are also available from "America on Line", the comp.binaries.ms-windows archive sites, and various BBS systems. It is also available by anonymous ftp + ftp://ftp.wustl.edu/usenet/comp.binaries.ms-windows/v3/in3demo .zip + ftp://ftp.uwasa.fi/mirror/ultrasound/demo/in3demo.zip An equivilant Sun product is described below. * Contact: Brantley Kelly Email: cbk@gacc.atl.ga.us CIS: 75120,431 FAX: 1-404-925-7924 Phone: 1-404-925-7950 Command Corp. Inc, 3675 Crestwood Parkway, Duluth GA 30136, USA IN3 Voice Command * Platform: Sun SPARCstation * Description: IN3 provides a secure, robust, word spotting, continuous speech recognition facility for the Sun OS or Solaris operating systems. The recognition system is a secure operating system facility capable of working with various interfaces, microphones, and devices. The operating system interface works with native UNIX outside of X Windows as well as provides enhanced X Windows facilities including named window support. The user interface provides a means to quickly create commands on the fly for replacing long strings and complex operations with voice macros. [Voice macros can reduce the strain of repetitive stress injuries (RSI) such as Carpel Tunnel Syndrome (CTS) by replacing heavy repetitive keyboard hammering with simple voice operations. ] The IN3 user interface works with generic X servers and window managers. A developer API is also available for creating voice- enabled applications, interfacing with other audio sources, and providing extensive application control over the recognition facility. * Availability: SunSite archive at SunSITE.unc.edu as well as on Catalyst CDware as both a runable demo and unlockable software. * Hardware Required: Sun SPARCstation with audio input. Noise canceling microphone recommended but not required. * Software Required: + Sun OS 4.1.2 with OpenWindows 3.0 + or, Sun OS 4.1.3 + or, Solaris 2.1 or Solaris 2.2 * Misc: An equivilant MS-Windows product is described above. * Price: $495 U.S. * Contact: Brantley Kelly Email: cbk@gacc.atl.ga.us CIS: 75120,431 FAX: 1-404-925-7924 Phone: 1-404-813-8030 Command Corp. Inc, 3675 Crestwood Parkway, Duluth GA 30136, USA Phonetic Engine 400 (PE400) - Speech Systems, Inc. * Platform: PC * Description: Speaker independent, large vocabulary, continuous speech recognition for MS Windows or DOS. * Rough Cost: $1195 US dollars. Includes board, microphone, developer kit, documentation, 2 days of technical training and 90 days of technical support. * Requirements: IBM AT class machine or better plus 5M disk space. Most processing is performed on-board (4M standard or 16M upgrade). * Misc: Requires developer to provide a context-free grammar. Vocabulary size unknown (quotes from 500 - 2000 words per grammar), but dynamic grammar switching capabilities may increase the effective vocabulary size. Development system includes lower-level C,C++ library (VoiceLib), higher-level DLL (SPOT) callable from many languages, SPOT/VBX, a custom control for Visual Basic and Visual C++. * Contact: Speech Systems, Inc. 2945 Center Green Court South Boulder, CO 80301-2275, USA Tel: 303.938.1110 Fax: 303.938.1874 SayIt * Platform: Sun SPARCstation * Description: Voice recognition and macro building package for Suns in the Openwindows 3.0 environment. Speaker dependent discrete speech recognition. Vocabularies can be associated to applications and the active vocabulary follows the application that has input focus. Macros can include mouse commands, keystrokes, Unix commands, sound, Openwindow actions and more. An evaluation copy is available by email. * Hardware: Microphone required (SunMicrophone is fine). * Cost: $US295 * Contact: Phone: 1-800-245-UNIX or 1-415-572-0200 Fax: 1-415-572-1300 Email: info@qualix.com Kurzweil Voice for Windows * Platform: MS Windows 3.1 * Description: Kurzweil Voice for Windows is a dictation product enabling the user to create text and enter data by speaking to Windows-based applications. System is adaptive but requires no initial training. Users can choose either 30,000 or 60,000 word active vocabulary. Application command translation templates for popular Windows application such as WordPerfect, 1-2-3, Organizer, Word. * Cost: US $995 * Hardware: 486DX/33 or higher, 8 or 16 MB dedicated memory (depends on vocabulary, 30 MBs dedicated disk space, VGA or higher, Kurzweil-supplied microphone and DSP board. * Contact: Phone: 1-800-380-1234 Email: info@kurz-ai.com D6006 Voice Control Processor * Platform: ? * Description: ? * Contact: DSP Telecommunications Inc. 2855 Kifer Road, Suite 202, Santa Clara CA 95051, USA Tel:(408)986-4310 Fax:(408)986-4324 Speech Commander - Listen for Windows * Platform: ? * Description: ? * Contact: Verbex Voice Systems 1090 King Georges Post Rd., Bldg 107, Edison NJ 08837, USA Tel:(908)225-5225 Fax:(908)225-7764 Voice-Trek 2.0 * Platform: ? * Description: ? * Contact: Tardis Technology Inc., Voice Recognition Div. 10321 Los Alamitos Blvd., Los Alamitos CA 90720 Tel:(310)799-3355 Fax:(310)799-3360 Visus SpeechKit * Platform: NeXT * Description: SpeechKit is based on SPHINX, a speaker-independent, 1000 word or so, continuous speech recognition system which allows you to incorporate speech recognition into your applications. You can design your vocabulary and grammars. * Contact: Visus - no address or phone provided. A possible contact is Robert Brennan at Carnegie Mellon University. email: Robert_Brennan@cmu.edu recnet * Platform: UNIX * Description: Speech recognition for the speaker independent TIMIT and Resource Management tasks. It uses recurrent networks to estimate phone probabilities and Markov models to find the most probable sequence of phones or words. The system is a snapshot of evolving research code. There is no documentation other than published research papers. The components are: + A preprocessor which implements many standard and many non- standard front end processing techniques. + A recurrent net recogniser and parameter files + Two Markov model based recognisers, one for phone recognition and one for word recognition + A dynamic programming scoring package The complete system performs competatively. * Cost: Free * Requirements: TIMIT and Resource Management databases * Contact: Tony Robinson: ajr@eng.cam.ac.uk * Availability: by anonymous ftp + ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/misc/recnet-1.3.ta r.Z Lotec Speech Recognition Package * Platform: Sun * Description: Public domain speech recognition software. Operates from input in Sun audio format (.au files) and outputs word hypotheses and time labelling data. The software includes programs to collect speech samples, a labeller, a "featurizer" which parameterises speech files, a word spotter and the recogniser. The software can perform real time recognition on a Sparc 10 for small vocabularies. * Requirements: Sun SPARC audio input and a "decent" microphone Sun multimedia demo software (in /usr/demo/SOUND) and X. * Availability: By anonymous ftp + ftp://ftp.sanpo.t.u-tokyo.ac.jp/pub/nigel/lotec/lotec.tar.Z * Contact: Nigel Ward: nigel@sanpo.t.u-tokyo.ac.jp Myers' Hidden Markov Model software * Description: Hidden Markov model software for automatic speech recognition. C++ code that implements a basic left-right hidden Markov model and corresponding Baum-Welch (ML) training algorithm. It is meant as an example of the HMM algorithms described by L.Rabiner and others. The code was built in order to learn how HMM systems work and we are now offering it to the net so that others can learn how to use HMMs for speech recognition. Keep in mind that ease of understanding was pit primary concern, not efficiency. The code can be used to build an experimental speech recognition systems using "train_hmm" and "test_hmm", and can be used in conjunction with written tutorials on HMMs to understand how they work. * Availability: By anonymous ftp from the comp.speech archive site. There are three files in the directory + ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/sources The files are + hmm.README + hmm-1.0.tar.Z + OR, hmm-1.0.tar.gz (Note: hmm-1.0.tar.Z and hmm-1.0.tar.gz compressed and GNU compressed versions of the same files) * Contact: Richard Myers: email rmyers@ics.uci.edu Voice Command Line Interface * Platform: Amiga * Description: VCLI will execute CLI commands, ARexx commands, or ARexx scripts by voice command through your audio digitizer. VCLI allows you to launch multiple applications or control any program with an ARexx capability entirely by spoken voice command. VCLI is fully multitasking and will run in the background, continuously listening for your voice commands even while other programs are running. Documentation is provided in AmigaGuide format. VCLI 6.0 runs under either Amiga DOS 2.0 or 3.0. * Cost: Free? * Requirements: Supports the DSS8, PerfectSound 3, Sound Master, Sound Magic, and Generic audio digitizers. * Availability: by ftp from wuarchive.wustl.edu in the file systems/amiga/incoming/audio/VCLI60.lha and from amiga.physik.unizh.ch as the file pub/aminet/util/misc/VCLI60.lha * Contact: Author's email is RHorne@cup.portal.com DATAVOX - French * Platform: PC * Description: Continuous speech - speaker independent or dependent. * Rough Cost: ? * Requirements: 2 PC format boards (RdF1000 and TdS 96/25) and an A/D - D/A module (ASA116) * Misc: Application software may dialog with DATAVOX through 2 types of interfaces : + Keyboard overlay: The application software may be used with any PC compatible package. No specific adaptation is necessary, you only need to define your configuration with the application software. + C library: Allows a user-written program to drive the recognition system. DATAVOX is based on the AMADEUS speech recognition software developed at LIMSI. It provides + Continuous speech recognition with 500 words speaker dependent, 50 words speaker independent (custom-made vocabulary). + Grammar of the application language (syntax acquisition, verification and simplification software). + Large vocabulary : DATAVOX can recognize vocabularies of several thousand words as long as there are no more than 500 words in the active vocabulary at any given node. It takes less than 1 second to change syntax and vocabulary. + Training controlled by the system (use of co-articulation models). + Response time less than 500 ms for any phrase length. + Synthetis (ADPCM) can be heard simultaneously while recognition is being carried out. * Contact: VECSYS Le Chene rond, 91570 Bievres, France Fax: 33 1 69 41 24 30 Voice: 33 1 69 41 15 04 PowerSecretary * Platform: Centris 650, 660AV. Quadra 650, 660AV, 700,800, 840AV, 900, 950. * Description: Speaker dependent/adaptive system requiring words to be separated by short pauses. * Vocabulary: 30,000 at any one time, automatically selected from 120,000-word dictionary. * Cost: US$2,495; non-AV machines need an audio board will cost about US$300. * Requirements: Minimum of 16M of ram and System 7.0. * Contact: Articulate Systems 600 W. Cummings Park, Suite 4500 Woburn, MA 01801 Ph: (617) 935-5656 Fax: (617) 935-0490. ICSS system from IBM * Description: A large vocabulary, speaker independent, continuous speech system which runs under Windows, OS/2, and AIX. * Requirements: Soundboard (e.g. Soundblaster) * Price: $US319 * Contact: A&G Graphics Interface ICSS Reseller 51 Gore Street, Cambridge, MA, 02139, USA (617) 492-0120 Custom Voice(TM) by A&G Graphics Interface * Description: Speech recognition custom control for Visual Basic, Visual C++, Borland C++, and other development platforms that support *.VBX. Provides an engine/proprietary independent development platform for speech recognition. Currently supports ICSS, but should soon support other platforms. Includes a grammar debugger and parser APIs to parse spoken speech into useful data types. * Requirements: Visual Basic or any development platform that supports VBX. * Price: $US495 or $695 bundled with ICSS. * Contact: A&G Graphics Interface 51 Gore Street, Cambridge, MA, 02139, USA (617) 492-0120 Creative VoiceAssist * Platform: PC (?) * Price: $US99.95 * Contact: Creative Labs Ph: 1-800-998-5227 _________________________________________________________________ Andrew Hunt --- Speech Technology Research Group Ph: 61-2-351 4509 Dept. of Electrical Engineering Fax: 61-2-351 3847 University of Sydney, NSW, 2006, Australia email: andrewh@speech.su.oz.au